Python HTML parser chokes on UTF-8 input

Johannes Bauer

Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().r ead(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:

prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()

Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:

website = website.replace(u"föö", u"bär")

Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:

website = website.decode("latin1")
website = website.replace(u"föö", u"bär")
website = website.encode("latin1")

This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input. However when I omit the reecoding to latin1:

File "CachedWebParser.py", line 13, in __init__
self.__process(website)
File "CachedWebParser.py", line 55, in __process
prs.feed(website)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)

Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.

Can I do something about it?

Regards,
Johannes

--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48**********************@news.sunrise.ch>

Oct 9 '08 #1

Subscribe Post Reply

7289

Terry Reedy

Johannes Bauer wrote:

Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().r ead(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:

I believe you are confusing unicode with unicode encoded into bytes with
the UTF-8 encoding. Having a problem feeding a unicode string, not
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.

>
prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()

Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:

website = website.replace(u"föö", u"bär")

Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:

website = website.decode("latin1") # produces unicode
website = website.replace(u"föö", u"bär") #remains unicode
website = website.encode("latin1") # produces byte string in the latin-1 encoding

This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input.

To me, code that works is prettier than code that does not.

In 3.0, text strings are unicode, and I believe that is what the parser
now accepts.

>However when I omit the reecoding to latin1:

File "CachedWebParser.py", line 13, in __init__
self.__process(website)
File "CachedWebParser.py", line 55, in __process
prs.feed(website)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)

When you do not bother to specify some other encoding in an encoding
operation, sgmllib or something deeper in Python tries the default
encoding, which does not work. Stop being annoyed and tell the
interpreter what you want. It is not a mind-reader.

Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.

The first version of Python came out in 1989, I believe, years before
unicode. One of the features of the new 3.0 version is that is uses
unicode as the standard for text.

Terry Jan Reedy

Oct 9 '08 #2

Johannes Bauer

Terry Reedy schrieb:

Johannes Bauer wrote:
>Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse(). read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:

I believe you are confusing unicode with unicode encoded into bytes with
the UTF-8 encoding. Having a problem feeding a unicode string, not
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.

I also believe I am. Could you please elaborate further?

Do I understand correctly when saying that type 'str' has no associated
default encoding, but type 'unicode' does? Does this mean that really
the only way of coping with that stuff is doing what I've been doing?

>This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input.

To me, code that works is prettier than code that does not.

In 3.0, text strings are unicode, and I believe that is what the parser
now accepts.

Well, yes, I suppose working code is nicer than non-working code.
However I am sure you will agree that explicit encoding conversions are
cumbersome and error-prone.

>UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)

When you do not bother to specify some other encoding in an encoding
operation, sgmllib or something deeper in Python tries the default
encoding, which does not work. Stop being annoyed and tell the
interpreter what you want. It is not a mind-reader.

How do I tell the interpreter to parse the strings I pass to it as
unicode? The way I did or is there some better way?

>Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.

The first version of Python came out in 1989, I believe, years before
unicode. One of the features of the new 3.0 version is that is uses
unicode as the standard for text.

Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
do you know when will approximately be ready?

Regards,
Johannes

--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48**********************@news.sunrise.ch>

Oct 9 '08 #3

Terry Reedy

Johannes Bauer wrote:

Terry Reedy schrieb:
>Johannes Bauer wrote:
>>Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse() .read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:
I believe you are confusing unicode with unicode encoded into bytes with
the UTF-8 encoding. Having a problem feeding a unicode string, not
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.

I also believe I am. Could you please elaborate further?

I am a unicode neophyte. My source of info is the first 3 or so
chapters of the unicode specification.
http://www.unicode.org/versions/Unicode5.1.0/
I recommend that or other sites for other questions. It took me more
than one reading of the same topics in different texts to pretty well
'get it'

Do I understand correctly when saying that type 'str' has no associated
default encoding, but type 'unicode' does?

I am not sure what you mean. Unicode strings in Python are internally
stored in USC-2 or UCS-4 format.

Does this mean that really
the only way of coping with that stuff is doing what I've been doing?

Having two text types in 2.x was necessary as a transition strategy but
has also been something of a mess. You did it one way. Jerry gave you
an alternative that I could not have explained. Your choice. Or use 3.0.

...

Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
do you know when will approximately be ready?

For my current purposes, it is ready enough. Developers *really* hope
to get 3.0 final out by mid-December. The schedule was pushed back
because a) the outside world has not completely and cleanly switched to
unicode text and b) some people who just started with the release
candidate have found import bugs that earlier testers did not. It still
needs more testing from more different users (hint, hint).

Terry Jan Reedy

Oct 10 '08 #4

Marc 'BlackJack' Rintsch

On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote:

Terry Reedy schrieb:
>I believe you are confusing unicode with unicode encoded into bytes
with the UTF-8 encoding. Having a problem feeding a unicode string,
not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte
string.

I also believe I am. Could you please elaborate further?

Do I understand correctly when saying that type 'str' has no associated
default encoding, but type 'unicode' does?

`str` doesn't know an encoding. The content could be any byte data
anyway. And `unicode` doesn't know an encoding either, it is unicode
characters. How they are represented internally is not the business of
the programmer. If you want operate with unicode characters you have to
decode a byte string (`str`) with the appropriate encoding. If you want
feed `unicode` to something that expects bytes and not unicode characters
you have to encode again.

>>This is incredibly ugly IMHO, as I would really like the parser to
just accept UTF-8 input.

It accepts UTF-8 input but not `unicode` objects.

However I am sure you will agree that explicit encoding conversions are
cumbersome and error-prone.

But implicit conversions are impossible because the interpreter doesn't
know which encoding to use and refuses to guess. Implicit and guessed
conversions are error prone too.

Ciao,
Marc 'BlackJack' Rintsch

Oct 10 '08 #5

John Nagle

Johannes Bauer wrote:

Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().r ead(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:

Try BeautifulSoup. It actually understands how to detect the encoding
of an HTML file (there are three different ways that information can be
expressed), and will shift modes accordingly.

This is an ugly problem. Sometimes, it's necessary to parse part of
the file, discover that the rest of the file has a non-ASCII encoding,
and restart the parse from the beginning. BeautifulSoup has the
machinery for that.

John Nagle

Oct 17 '08 #6

Similar topics

Modify Python Code - no idea at all

by: Chris McKeever | last post by:

I am trying to modify the Mailman Python code to stop mapping MIME-types and use the extension of the attachment instead. I am pretty much clueless as to what I need to do here, but I think I have...

Python

Weekly Python Patch/Bug Summary

by: Kurt B. Kaiser | last post by:

Patch / Bug Summary ___________________ Patches : 240 open ( -1) / 2655 closed (+15) / 2895 total (+14) Bugs : 766 open ( +0) / 4514 closed (+22) / 5280 total (+22) RFE : 155 open...

Python

Behaviour of htmllib's HTML parser and formatter

by: Morten W. Petersen | last post by:

Hi, I have an HTML page that displays some content, and a part of that content is HTML changed into regular text. The encoding of the page is UTF-8. Here's the code that makes the change...

Python

Python for Vcard Parsing in UTF16

by: R Wood | last post by:

Greetings - A recent Perl experiment hasn't turned out so well, which has piqued my interest in Python. The project is this: take a Vcard file exported from Apple's Addressbook and use a...

Python

How to do this in python with regular expressions

by: Jia Lu | last post by:

Hi all I'm trying to parsing html with re module. html = """ <TABLE BORDER=1 cellspacing=0 cellpadding=2> <TR> <TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</...

Python

Python-URL! - weekly Python news and links (May 28)

by: Gabriel Genellina | last post by:

QOTW: "Good God! Is there *anything* that python does not already do? I hardly feel the need to write programs anymore ... Its really 80% like of the questions that are asked here get answered...

Python

RE: SQLite and Python 2.4

by: Joe Goldthwaite | last post by:

Thanks Guilherme. That helped. I guess I was thinking that pysqlite would automatically come with some version of sqlite. The fact that it doesn't is what was causing me to get the strange...

Python

Python and decimal character entities over 128.

by: bsagert | last post by:

Some web feeds use decimal character entities that seem to confuse Python (or me). For example, the string "doesn't" may be coded as "doesn’t" which should produce a right leaning apostrophe....

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General