473,473 Members | 1,750 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Convert from unicode chars to HTML entities

I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"© and many more..."

Decimal/hex escapes would be acceptable:
"© and many more..."
"© and many more..."

I can look up tables of HTML entities on the web (they're a dime a
dozen), turn them into a dict mapping character to entity, then convert
the string by hand. Is there a "batteries included" solution that doesn't
involve reinventing the wheel?
--
Steven D'Aprano
Jan 29 '07 #1
8 19783
Steven D'Aprano wrote:
I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"© and many more..."

Decimal/hex escapes would be acceptable:
"© and many more..."
"© and many more..."

I can look up tables of HTML entities on the web (they're a dime a
dozen), turn them into a dict mapping character to entity, then convert
the string by hand. Is there a "batteries included" solution that doesn't
involve reinventing the wheel?

Its *very* ugly, but im pretty sure you can make it look prettier.

import htmlentitydefs as entity

s = u"© and many more..."
t = ""
for i in s:
if ord(i) in entity.codepoint2name:
name = entity.codepoint2name.get(ord(i))
entityCode = entity.name2codepoint.get(name)
t +="&#" + str(entityCode)
else:
t += i
print t

Hope this helps.

Adonis
Jan 29 '07 #2
Adonis Vargas wrote:
[...]
>
Its *very* ugly, but im pretty sure you can make it look prettier.

import htmlentitydefs as entity

s = u"© and many more..."
t = ""
for i in s:
if ord(i) in entity.codepoint2name:
name = entity.codepoint2name.get(ord(i))
entityCode = entity.name2codepoint.get(name)
t +="&#" + str(entityCode)
else:
t += i
print t

Hope this helps.

Adonis
or

import htmlentitydefs as entity

s = u"© and many more..."
t = u""
for i in s:
if ord(i) in entity.codepoint2name:
name = entity.codepoint2name.get(ord(i))
t += "&" + name + ";"
else:
t += i
print t

Which I think is what you were looking for.

Adonis
Jan 29 '07 #3
En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano
<st***@REMOVEME.cybersource.com.auescribió:
I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"&copy; and many more..."
Module htmlentitydefs contains the tables you're looking for, but you need
a few transforms:

<code>
# -*- coding: iso-8859-15 -*-
from htmlentitydefs import codepoint2name

unichr2entity = dict((unichr(code), u'&%s;' % name)
for code,name in codepoint2name.iteritems()
if code!=38) # exclude "&"

def htmlescape(text, d=unichr2entity):
if u"&" in text:
text = text.replace(u"&", u"&amp;")
for key, value in d.iteritems():
if key in text:
text = text.replace(key, value)
return text

print '%r' % htmlescape(u'hello')
print '%r' % htmlescape(u'"©® áé&ö <²³>')
</code>

Output:
u'hello'
u'&quot;&copy;&reg; &aacute;&eacute;&amp;&ouml; &lt;&sup2;&sup3;&gt;'

The result is an unicode object, with all known entities replaced. It does
not handle missing, unknown entities - as the docs for htmlentitydefs say,
"the definition provided here contains all the entities defined by XHTML
1.0 that can be handled using simple textual substitution in the Latin-1
character set (ISO-8859-1)."

--
Gabriel Genellina

Jan 29 '07 #4
Steven D'Aprano wrote:
I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"&copy; and many more..."

Decimal/hex escapes would be acceptable:
"© and many more..."
"&#xA9; and many more..."
>>s = u"© and many more..."
s.encode('ascii', 'xmlcharrefreplace')
'© and many more...'
Jan 29 '07 #5
On Sun, 28 Jan 2007 23:41:19 -0500, Leif K-Brooks wrote:
>>s = u"© and many more..."
>>s.encode('ascii', 'xmlcharrefreplace')
'© and many more...'
Wow. That's short and to the point. I like it.

A few issues:

(1) It doesn't seem to be reversible:
>>'© and many more...'.decode('latin-1')
u'© and many more...'

What should I do instead?
(2) Are XML entities guaranteed to be the same as HTML entities?
(3) Is there a way to find out at runtime what encoders/decoders/error
handlers are available, and what they do?
Thanks,
--
Steven D'Aprano

Jan 29 '07 #6
Steven D'Aprano wrote:
A few issues:

(1) It doesn't seem to be reversible:
>>>'© and many more...'.decode('latin-1')
u'© and many more...'

What should I do instead?
Unfortunately, there's nothing in the standard library that can do that,
as far as I know. You'll have to write your own function. Here's one
I've used before (partially stolen from code in Python patch #912410
which was written by Aaron Swartz):

from htmlentitydefs import name2codepoint
import re

def _replace_entity(m):
s = m.group(1)
if s[0] == u'#':
s = s[1:]
try:
if s[0] in u'xX':
c = int(s[1:], 16)
else:
c = int(s)
return unichr(c)
except ValueError:
return m.group(0)
else:
try:
return unichr(name2codepoint[s])
except (ValueError, KeyError):
return m.group(0)

_entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
def unescape(s):
return _entity_re.sub(_replace_entity, s)
(2) Are XML entities guaranteed to be the same as HTML entities?
XML defines one entity which doesn't exist in HTML: &apos;. But
xmlcharrefreplace only generates numeric character references, and those
should be the same between XML and HTML.
(3) Is there a way to find out at runtime what encoders/decoders/error
handlers are available, and what they do?
From what I remember, that's not possible because the codec system is
designed so that functions taking names are registered instead of the
names themselves. But all of the standard codecs are documented at
<http://python.org/doc/current/lib/standard-encodings.html>, and all of
the standard error handlers are documented at
<http://python.org/doc/current/lib/codec-base-classes.html>.
Jan 29 '07 #7
Steven D'Aprano schrieb:
A few issues:

(1) It doesn't seem to be reversible:
>>>'© and many more...'.decode('latin-1')
u'© and many more...'

What should I do instead?
For reverse processing, you need to parse it with an
SGML/XML parser.
(2) Are XML entities guaranteed to be the same as HTML entities?
Please make a terminology difference between "entity", "entity
reference", and "character reference".

An (external parsed) entity is a named piece of text, such
as the copyright character. An entity reference is a reference
to such a thing, e.g. &copy;

A character reference is a reference to a character, not to
an entity. xmlcharrefreplace generates character references,
not entity references (let alone generating entities). The
character references in XML and HTML both reference by
Unicode ordinal, so it is "the same".
(3) Is there a way to find out at runtime what encoders/decoders/error
handlers are available, and what they do?
Not through Python code. In C code, you can look at the
codec_error_registry field of the interpreter object.

Regards,
Martin
Jan 29 '07 #8
Steven D'Aprano <st***@removeme.cybersource.com.auwrote:
I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"&copy; and many more..."
[...[
Is there a "batteries included" solution that doesn't involve
reinventing the wheel?
recode is good for this kind of things:

$ recode latin1..html -d mytextfile

It seems that there are recode bindings for Python:

$ apt-cache search recode | grep python
python-bibtex - Python interfaces to BibTeX and the GNU Recode library

HTH, cheers.
--
Roberto Bonvallet
Feb 8 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
8
by: davihigh | last post by:
My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does...
2
by: Albert | last post by:
How can I convert some html entities (polish) to a character which I can use in a javascript alert? The entities are: &#x0144; &#x0119; &#x0144; &#x0105; &#x017C; &#x015B; &#x0107;
2
by: Frantic | last post by:
I'm working on a list of japaneese entities that contain the entity, the unicode hexadecimal code and the xml/sgml entity used for that entity. A unicode document is read into the program, then the...
3
by: Laangen_LU | last post by:
Dear Group, my first post to this group, so if I'm on the wrong group, my apologies. I'm trying to send out an email in Chinese lanuage using the mail() function in PHP. Subject and...
1
by: Alexander.Veit | last post by:
Hallo, does anyone know how to convert HTML Entities into UCS2-String (Value). For example: I need to convert Sułowska 43 (value in mysql database) to a unicode string with the specified...
3
by: ldng | last post by:
Hi, I'm looking for a way to convert en unicode string encoded in UTF-8 to a raw string escaped with HTML Entities. I can't seem to find an easy way to do it. Quote from urllib will only work...
6
by: Clodoaldo | last post by:
I was looking for a function to transform a unicode string into htmlentities. Not only the usual html escaping thing but all characters. As I didn't find I wrote my own: # -*- coding: utf-8...
2
by: neovantage | last post by:
hey geeks, I am using a function which convert unicode to entities. So that i can save values into mysql database into entities. This function really helps me when i display the store entity data...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.