471,073 Members | 1,390 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,073 software developers and data experts.

xHTML/XML to Unicode (and back)

Hey guys

I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?

Thanks

-Rob
Jan 24 '06 #1
3 1529
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?


any decent html-aware screen scraper library should be able to do
this for you.

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

Jan 24 '06 #2
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Thanks

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>


Jan 24 '06 #3
Robin Haswell wrote:
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


And if it's really XHTML/XML, why not just use an XML parser? ;-)
I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)


Anyway, on the subject of XML parsers, here's something to try out:

import libxml2dom
import urllib
f = urllib.urlopen("http://www.sweden.se/") # some Swedish site!
s = f.read()
f.close()
d = libxml2dom.parseString(s, html=1)

Here, we assume that the site isn't well-formed XML and must be treated
as HTML, which libxml2 seems to be fairly good at doing. Then...

for a in d.xpath("//a"):
print repr(a.getAttribute("href")), \
repr(a.getAttribute("title")), \
repr(a.nodeValue)

Here, we print out some of the hyperlinks in the page using repr to
show what the strings look like (and in a way that doesn't require you
to encode them for your terminal). On the above Swedish site, you'll
see some things like this:

u'Fran\xe7ais'

What's interesting is that in some cases such strings may have been
encoded using entities (such as in the title attributes), whereas in
other cases they may have been encoded using UTF-8 byte sequences (such
as in the link texts). The nice thing is that libxml2 just works it out
on your behalf.

So there's no compelling need for regular expressions, but I'm sure
Fredrik will offer some alternative suggestions... and possibly some
good Swedish links, too. ;-)

Paul

Jan 24 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

23 posts views Thread by Mikko Ohtamaa | last post: by
15 posts views Thread by Nicolai Pedersen | last post: by
reply views Thread by Thanos Tsouanas | last post: by
23 posts views Thread by Gustaf | last post: by
11 posts views Thread by Michael Powe | last post: by
5 posts views Thread by grawsha2000 | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.