472,962 Members | 2,386 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,962 software developers and data experts.

xHTML/XML to Unicode (and back)

Hey guys

I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?

Thanks

-Rob
Jan 24 '06 #1
3 1588
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?


any decent html-aware screen scraper library should be able to do
this for you.

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

Jan 24 '06 #2
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Thanks

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>


Jan 24 '06 #3
Robin Haswell wrote:
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


And if it's really XHTML/XML, why not just use an XML parser? ;-)
I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)


Anyway, on the subject of XML parsers, here's something to try out:

import libxml2dom
import urllib
f = urllib.urlopen("http://www.sweden.se/") # some Swedish site!
s = f.read()
f.close()
d = libxml2dom.parseString(s, html=1)

Here, we assume that the site isn't well-formed XML and must be treated
as HTML, which libxml2 seems to be fairly good at doing. Then...

for a in d.xpath("//a"):
print repr(a.getAttribute("href")), \
repr(a.getAttribute("title")), \
repr(a.nodeValue)

Here, we print out some of the hyperlinks in the page using repr to
show what the strings look like (and in a way that doesn't require you
to encode them for your terminal). On the above Swedish site, you'll
see some things like this:

u'Fran\xe7ais'

What's interesting is that in some cases such strings may have been
encoded using entities (such as in the title attributes), whereas in
other cases they may have been encoded using UTF-8 byte sequences (such
as in the link texts). The nice thing is that libxml2 just works it out
on your behalf.

So there's no compelling need for regular expressions, but I'm sure
Fredrik will offer some alternative suggestions... and possibly some
good Swedish links, too. ;-)

Paul

Jan 24 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Mikko Ohtamaa | last post by:
From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...
15
by: Nicolai Pedersen | last post by:
I have a problem validating a simple piece of XHTML containing danish characters. Trying to validate the following piece of XHTML gives the error mentioned beneath. If I remove the first line (the...
76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...
0
by: Thanos Tsouanas | last post by:
Hello list! I've got a unicode string which holds an xhtml website, begining with <!DOCTYPE ... and a functions which alters strings in a certain way. I want to replace all actual data in this...
23
by: Gustaf | last post by:
I just read this article from today: http://webstandards.org/buzz/archive/2005_09.html I need some help understanding this sentense: The W3C recommends XHTML 1.1 should be served with the...
9
by: anupamjain | last post by:
Hi, After 2 weeks of search/hit-and-trial I finally thought to revert to the group to find solution to my problem.(something I should have done much earlier) This is the deal : On a JSP...
11
by: Michael Powe | last post by:
How can I make an XHTML-compliant form of an expression in this format: document.write("<scr"+"ipt type='text/javascript' src='path/to/file.js'>"+"</scr"+"ipt>"); this turns out to be a...
5
by: grawsha2000 | last post by:
Hi, I'm trying to save unicode chars. in byte array. The problem is when I try retrieve the saved chars back from the array I get different chars from the one I saved. Code: dim uni as...
11
by: GTalbot | last post by:
On Jul 19, 9:00pm, Guy Macon <http://www.GuyMacon.com/wrote: HTML 4.01 strict is the best and most recommendable choice for now. 2 main differences between HTML 4.01 strict and HTML 5 is that...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Aliciasmith | last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
2
by: giovanniandrean | last post by:
The energy model is structured as follows and uses excel sheets to give input data: 1-Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
3
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
0
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
2
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.