473,322 Members | 1,522 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

converting html escape sequences to unicode characters

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley

Jul 18 '05 #1
3 7284
harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...


In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
Jul 18 '05 #2
On Fri, 2004-12-10 at 08:36, harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:
escapeseq = '비'
uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
uescape u'\ube44' print uescape 비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.
entities = ['비', '행', '기', '로', '보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠'] def unescape(escapeseq): .... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
.... print ' '.join([ unescape(x) for x in entities ])

비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 *

--
Craig Ringer

Jul 18 '05 #3
On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:
On Fri, 2004-12-10 at 08:36, harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:


It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

--
Craig Ringer

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

18
by: SwordAngel | last post by:
Hello, I'm looking for a program that converts characters of different encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand escape sequences. Anybody knows where I can find one? ...
11
by: Patrick Van Esch | last post by:
Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two...
24
by: chri_schiller | last post by:
I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry...
15
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN....
131
by: Lawrence D'Oliveiro | last post by:
The "escape" function in the "cgi" module escapes characters with special meanings in HTML. The ones that need escaping are '<', '&' and '"'. However, cgi.escape only escapes the quote character if...
1
by: jeffejohnson | last post by:
I'm looking to see if anyone has experienced this... I've got a dropdown that I'm populating dynamically and the items include HTML special characters (like &Ocirc;). If I load them from an...
9
by: Michael Goerz | last post by:
Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "Í" (latin capital I with acute,...
2
by: | last post by:
I mainly work on OS X, but thought I'd experiment with some Python code on XP. The problem is I can't seem to get these things to work at all. First of all, I'd like to use Greek letters in the...
5
by: John Ztwin | last post by:
Hello, I have a file that contains ordinary text and some special charaters in Unicode escape sequences (\uxxxx). When I read the file using e.g. StreamReader Unicode escape sequences are not...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, youll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.