xHTML/XML to Unicode (and back)

Robin Haswell

Hey guys

I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?

Thanks

-Rob

Jan 24 '06 #1

Subscribe Post Reply

1602

Fredrik Lundh

Robin Haswell wrote:

I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?

any decent html-aware screen scraper library should be able to do
this for you.

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

Jan 24 '06 #2

Robin Haswell

On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:

Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.

I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Thanks

if you've already extracted the strings, the strip_html function on
this page might be what you need:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

Jan 24 '06 #3

Paul Boddie

Robin Haswell wrote:

On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
Robin Haswell wrote:
I'm currently screenscraping some Swedish site, and i need a method to
convert XML entities (& etc, plus d etc) to Unicode characters.
I'm sure one of python's myriad of XML processors can do this but I can't
find which one.

Can anyone make any suggestions?
any decent html-aware screen scraper library should be able to do
this for you.

And if it's really XHTML/XML, why not just use an XML parser? ;-)
I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Anyway, on the subject of XML parsers, here's something to try out:

import libxml2dom
import urllib
f = urllib.urlopen("http://www.sweden.se/") # some Swedish site!
s = f.read()
f.close()
d = libxml2dom.parseString(s, html=1)

Here, we assume that the site isn't well-formed XML and must be treated
as HTML, which libxml2 seems to be fairly good at doing. Then...

for a in d.xpath("//a"):
print repr(a.getAttribute("href")), \
repr(a.getAttribute("title")), \
repr(a.nodeValue)

Here, we print out some of the hyperlinks in the page using repr to
show what the strings look like (and in a way that doesn't require you
to encode them for your terminal). On the above Swedish site, you'll
see some things like this:

u'Fran\xe7ais'

What's interesting is that in some cases such strings may have been
encoded using entities (such as in the title attributes), whereas in
other cases they may have been encoded using UTF-8 byte sequences (such
as in the link texts). The nice thing is that libxml2 just works it out
on your behalf.

So there's no compelling need for regular expressions, but I'm sure
Fredrik will offer some alternative suggestions... and possibly some
good Swedish links, too. ;-)

Paul

Jan 24 '06 #4

by: Mikko Ohtamaa | last post by:

From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...

HTML / CSS

Validation of XHTML with danish characters

by: Nicolai Pedersen | last post by:

I have a problem validating a simple piece of XHTML containing danish characters. Trying to validate the following piece of XHTML gives the error mentioned beneath. If I remove the first line (the...

HTML / CSS

How do I display character 151 (long hyphen) in XHTML (utf-8) ?

by: Zenobia | last post by:

How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...

HTML / CSS

Changing data of xhtml document

by: Thanos Tsouanas | last post by:

Hello list! I've got a unicode string which holds an xhtml website, begining with <!DOCTYPE ... and a functions which alters strings in a certain way. I want to replace all actual data in this...

Python

application/xhtml+xml in IE

by: Gustaf | last post by:

I just read this article from today: http://webstandards.org/buzz/archive/2005_09.html I need some help understanding this sentense: The W3C recommends XHTML 1.1 should be served with the...

HTML / CSS

Converting HTML to XHTML (JTidy,OpenXML,Xerces)

by: anupamjain | last post by:

Hi, After 2 weeks of search/hit-and-trial I finally thought to revert to the group to find solution to my problem.(something I should have done much earlier) This is the deal : On a JSP...

.NET Framework

XHTML-compliant document.write

by: Michael Powe | last post by:

How can I make an XHTML-compliant form of an expression in this format: document.write("<scr"+"ipt type='text/javascript' src='path/to/file.js'>"+"</scr"+"ipt>"); this turns out to be a...

Javascript

storing unicode in byte array?

by: grawsha2000 | last post by:

Hi, I'm trying to save unicode chars. in byte array. The problem is when I try retrieve the saved chars back from the array I get different chars from the one I saved. Code: dim uni as...

Visual Basic .NET

Re: xhtml 1.0, xhtml 1.1, html 4.01, or html 5.X? -Guy Macon

by: GTalbot | last post by:

On Jul 19, 9:00 pm, Guy Macon <http://www.GuyMacon.com/wrote: HTML 4.01 strict is the best and most recommendable choice for now. 2 main differences between HTML 4.01 strict and HTML 5 is that...

HTML / CSS

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

xHTML/XML to Unicode (and back)

Similar topics