473,395 Members | 1,783 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

xHTML/XML to Unicode (and back)

Hey guys

I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?

Thanks

-Rob
Jan 24 '06 #1
3 1602
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?


any decent html-aware screen scraper library should be able to do
this for you.

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

Jan 24 '06 #2
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Thanks

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>


Jan 24 '06 #3
Robin Haswell wrote:
On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (&amp; etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.


And if it's really XHTML/XML, why not just use an XML parser? ;-)
I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)


Anyway, on the subject of XML parsers, here's something to try out:

import libxml2dom
import urllib
f = urllib.urlopen("http://www.sweden.se/") # some Swedish site!
s = f.read()
f.close()
d = libxml2dom.parseString(s, html=1)

Here, we assume that the site isn't well-formed XML and must be treated
as HTML, which libxml2 seems to be fairly good at doing. Then...

for a in d.xpath("//a"):
print repr(a.getAttribute("href")), \
repr(a.getAttribute("title")), \
repr(a.nodeValue)

Here, we print out some of the hyperlinks in the page using repr to
show what the strings look like (and in a way that doesn't require you
to encode them for your terminal). On the above Swedish site, you'll
see some things like this:

u'Fran\xe7ais'

What's interesting is that in some cases such strings may have been
encoded using entities (such as in the title attributes), whereas in
other cases they may have been encoded using UTF-8 byte sequences (such
as in the link texts). The nice thing is that libxml2 just works it out
on your behalf.

So there's no compelling need for regular expressions, but I'm sure
Fredrik will offer some alternative suggestions... and possibly some
good Swedish links, too. ;-)

Paul

Jan 24 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Mikko Ohtamaa | last post by:
From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...
15
by: Nicolai Pedersen | last post by:
I have a problem validating a simple piece of XHTML containing danish characters. Trying to validate the following piece of XHTML gives the error mentioned beneath. If I remove the first line (the...
76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...
0
by: Thanos Tsouanas | last post by:
Hello list! I've got a unicode string which holds an xhtml website, begining with <!DOCTYPE ... and a functions which alters strings in a certain way. I want to replace all actual data in this...
23
by: Gustaf | last post by:
I just read this article from today: http://webstandards.org/buzz/archive/2005_09.html I need some help understanding this sentense: The W3C recommends XHTML 1.1 should be served with the...
9
by: anupamjain | last post by:
Hi, After 2 weeks of search/hit-and-trial I finally thought to revert to the group to find solution to my problem.(something I should have done much earlier) This is the deal : On a JSP...
11
by: Michael Powe | last post by:
How can I make an XHTML-compliant form of an expression in this format: document.write("<scr"+"ipt type='text/javascript' src='path/to/file.js'>"+"</scr"+"ipt>"); this turns out to be a...
5
by: grawsha2000 | last post by:
Hi, I'm trying to save unicode chars. in byte array. The problem is when I try retrieve the saved chars back from the array I get different chars from the one I saved. Code: dim uni as...
11
by: GTalbot | last post by:
On Jul 19, 9:00 pm, Guy Macon <http://www.GuyMacon.com/wrote: HTML 4.01 strict is the best and most recommendable choice for now. 2 main differences between HTML 4.01 strict and HTML 5 is that...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.