473,586 Members | 2,702 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

converting html escape sequences to unicode characters

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley

Jul 18 '05 #1
3 7294
harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...


In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.nam e(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
Jul 18 '05 #2
On Fri, 2004-12-10 at 08:36, harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:
escapeseq = '비'
uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unic ode_escape")
uescape u'\ube44' print uescape 비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.
entities = ['비', '행', '기', '로', '보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠'] def unescape(escape seq): .... return ("\\u%x" % int(escapeseq[2:-1])).decode("unic ode_escape")
.... print ' '.join([ unescape(x) for x in entities ])

비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 *

--
Craig Ringer

Jul 18 '05 #3
On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:
On Fri, 2004-12-10 at 08:36, harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:


It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

--
Craig Ringer

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

18
14736
by: SwordAngel | last post by:
Hello, I'm looking for a program that converts characters of different encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand escape sequences. Anybody knows where I can find one? thx.
11
3642
by: Patrick Van Esch | last post by:
Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two bytes) into the HTML source, and save this source not as an ASCII text, but as a UNICODE text file (using 16 bits per character, also for the Western...
24
2817
by: chri_schiller | last post by:
I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry page has two chinese characters, but these are not seen on all browsers, even though the page is validated by the w3c validator. (...
15
18305
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN. Any idea if this is a bug? if (e.KeyChar == '\e') { this.Close(); }
131
9214
by: Lawrence D'Oliveiro | last post by:
The "escape" function in the "cgi" module escapes characters with special meanings in HTML. The ones that need escaping are '<', '&' and '"'. However, cgi.escape only escapes the quote character if you pass a second argument of True (the default is False): 'the "quick" &amp; &lt;brown&gt; fox' 'the &quot;quick&quot; &amp; &lt;brown&gt; fox' This seems to me to be...
1
1891
by: jeffejohnson | last post by:
I'm looking to see if anyone has experienced this... I've got a dropdown that I'm populating dynamically and the items include HTML special characters (like &Ocirc;). If I load them from an existing JavaScript array I don't have any problems, but I'm generating the arrays dynamically, then populating my dropdown dynamically with the onload...
9
11545
by: Michael Goerz | last post by:
Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "Í" (latin capital I with acute, code point 205) would come out as "\303\215". I will also have to read back from the file later on and convert the escaped characters back into a...
2
3301
by: | last post by:
I mainly work on OS X, but thought I'd experiment with some Python code on XP. The problem is I can't seem to get these things to work at all. First of all, I'd like to use Greek letters in the command prompt window, so I was going to use unicode to do this. But in the command prompt, the unicode characters are displaying as strange...
5
5320
by: John Ztwin | last post by:
Hello, I have a file that contains ordinary text and some special charaters in Unicode escape sequences (\uxxxx). When I read the file using e.g. StreamReader Unicode escape sequences are not converted to their character representation. They are shown excatly same way than in file. Literals in C# code's variables are shown corretly. ...
0
7911
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main...
0
7839
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8338
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
8215
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
3836
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3864
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2345
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1448
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1179
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.