470,848 Members | 2,238 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,848 developers. It's quick & easy.

codec for html/xml entities!?

Hi friends, I've been OFF-Python now for quite a while and am glad
being back. At least to some part as work permits.

Q:
What's a good way to encode and decode those entities like € or
€ ?

I need isolated functions to process lines. Looking at the xml and
sgmlib stuff I didn't really get a clue as to what's the most pythonic
way. Are there library functions I didn't see?

FYI, here is what I hacked down and what will probably (hopefully...)
do the job.

Feel free to comment.

# -*- coding: iso-8859-1 -*-
"""\
entity_stuff.py, mb, 2008-03-14, 2008-03-18

"""

import htmlentitydefs
import re

RE_OBJ_entity = re.compile('(&.+?;)')

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('&foobar;') -('&foobar;',False)
"""

gotCodepoint = False
gotUnichr = False
if entity.startswith('&#'):
if entity[2] == 'x':
base = 16
digits = entity[3:-1]
else:
base = 10
digits = entity[2:-1]
try:
v = int(digits,base)
gotCodepoint = True
except:
pass
else:
v = htmlentitydefs.name2codepoint.get(entity[1:-1],None)
if not v is None:
gotCodepoint = True

if gotCodepoint:
try:
v = unichr(v)
gotUnichr = True
except:
pass
if gotUnichr:
return v, gotUnichr
else:
return entity, gotUnichr

def line_entities_to_uc(line):
result = []
cntProblems = 0
for e in RE_OBJ_entity.split(line):
if e.startswith('&'):
e,success = entity2uc(e)
if not success:
cntProblems += 1
result.append(e)
return u''.join(result), cntProblems
def uc2entity(uc):
cp = ord(uc)
if cp 127:
name = htmlentitydefs.codepoint2name.get(cp,None)
if name:
result = '&%s;' % name
else:
result = '&#x%x;' % cp
else:
result = chr(cp)
return result

def encode_line(line):
return ''.join([uc2entity(u) for u in line])
if 1 and __name__=="__main__":
import codecs
infile = 'temp.ascii.xml'
outfile = 'temp.utf8.xml'
of = codecs.open(outfile,'wb','utf-8')
totalProblems = 0
totalLines = 0
for line in file(infile,'rb'):
line2, cntProblems = line_entities_to_uc(line)
of.write(line2)
totalLines += 1
totalProblems += cntProblems
of.close()
print
print "Summary:"
print " Infile : %s" % (infile,)
print " Outfile: %s" % (outfile,)
print ' %8d %s %s' % (totalLines,
['lines','line'][totalLines==1], 'written.')
print ' %8d %s %s' % (totalProblems,
['entities','entity'][totalProblems==1], 'left unconverted.')
print '%s' % ('Done.',)
Have a nice day and
ru, Martin
(read you, ;-)
Jun 27 '08 #1
3 1163
Martin Bless wrote:
What's a good way to encode and decode those entities like € or
€ ?
Hmm, since you provide code, I'm not quite sure what your actual question is.

So I'll just comment on the code here.

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('&foobar;') -('&foobar;',False)
"""
Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Stefan
Jun 27 '08 #2
[Stefan Behnel] wrote & schrieb:
>Martin Bless wrote:
>What's a good way to encode and decode those entities like € or
€ ?

Hmm, since you provide code, I'm not quite sure what your actual question is.
- What's a GOOD way?
- Am I reinventing the wheel?
- Are there well tested, fast, state of the art, builtin ways?
- Is something like line.decode('htmlentities') out there?
- Am I in conformity with relevant RFCs? (I'm hoping so ...)
>So I'll just comment on the code here.

>def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('&foobar;') -('&foobar;',False)
"""

Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?
Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.

Have a nice day

Martin
Jun 27 '08 #3
Martin Bless wrote:
[Stefan Behnel] wrote & schrieb:
>>def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('€') -(u'\u20ac',True)
entity2cp('&foobar;') -('&foobar;',False)
"""
Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.
The normal case is that it could be replaced, and it is an exceptional case
that it failed, in which case the caller has to deal with the problem in one
way or another. You are making the normal case more complicated, as the caller
*always* has to check the result indicator to see if the return value is the
expected result or something different. I don't think there is any reason to
require that, except when the conversion really failed.

Stefan
Jun 27 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Fuzzyman | last post: by
7 posts views Thread by Robert Oschler | last post: by
6 posts views Thread by Horst Gutmann | last post: by
2 posts views Thread by Joergen Bech | last post: by
6 posts views Thread by clintonG | last post: by
3 posts views Thread by Torsten Bronger | last post: by
1 post views Thread by Tim Arnold | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.