Connecting Tech Pros Worldwide Forums | Help | Site Map

converting html escape sequences to unicode characters

harrelson
Guest
 
Posts: n/a
#1: Jul 18 '05
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley


Kent Johnson
Guest
 
Posts: n/a
#2: Jul 18 '05

re: converting html escape sequences to unicode characters


harrelson wrote:[color=blue]
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8. Stuff like:
>
> 비
> 행
> 기
> 로
> 보
> 낼
> 거
> 에
> 요
> 내
> 면
> 금
> 이
> 얼
> 마
> 지
> 잠
>
> Anyone know what the decimal is representing? It doesn't seem to
> equate to a unicode codepoint...[/color]

In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
Craig Ringer
Guest
 
Posts: n/a
#3: Jul 18 '05

re: converting html escape sequences to unicode characters


On Fri, 2004-12-10 at 08:36, harrelson wrote:[color=blue]
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8. Stuff like:[/color]

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:
[color=blue][color=green][color=darkred]
>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape[/color][/color][/color]
u'\ube44'[color=blue][color=green][color=darkred]
>>> print uescape[/color][/color][/color]

(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.
[color=blue][color=green][color=darkred]
>>> entities = ['비', '행', '기', '로',[/color][/color][/color]
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠'][color=blue][color=green][color=darkred]
>>> def unescape(escapeseq):[/color][/color][/color]
.... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
....[color=blue][color=green][color=darkred]
>>> print ' '.join([ unescape(x) for x in entities ])[/color][/color][/color]
비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 *

--
Craig Ringer

Craig Ringer
Guest
 
Posts: n/a
#4: Jul 18 '05

re: converting html escape sequences to unicode characters


On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:[color=blue]
> On Fri, 2004-12-10 at 08:36, harrelson wrote:[color=green]
> > I have a list of about 2500 html escape sequences (decimal) that I need
> > to convert to utf-8. Stuff like:[/color]
>
> I'm pretty sure this somewhat horrifying code does it, but is probably
> an example of what not to do:[/color]

It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

--
Craig Ringer

Closed Thread