I'm working on a script to download and parse a web page, and it
includes xml symbol notation, such as ' for the ' character. Does
anyone know of a pre-existing python script/lib to convert the xml
notation back to the actual symbol it represents?
If you have this given in an XML file (rather than an HTML file which
is not well-formed XML), you could use an XML parser for the entire
file. This would automatically unescape character references. Likewise,
you can parse it with HTMLParser, which will invoke the handle_charref
method for these.
If you just want to unescape references, you can use the code in
http://effbot.org/zone/re-sub.htm
HTH,
Martin