473,396 Members | 1,766 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Py 2.5: Bug in sgmllib

Hi,

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')

I get the exception:

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)

The reason is that the character reference ß is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.
Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:

def convert_codepoint(self, codepoint):
return unichr(codepoint)

Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?

Michael
Oct 22 '06 #1
2 2104
Michael Butscher wrote:

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
source documents are encoded byte streams, not decoded Unicode
sequences. I suggest reading up on how Python's Unicode string
type is, and what a Unicode string represents. it's not the same
thing as a byte string.

</F>

Oct 22 '06 #2
Michael Butscher schrieb:
Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?
In a sense, SGML itself is not meant to be used for Unicode. In SGML,
the document character set is subject to the SGML application. So what
specific character a character reference refers to is also subject to
the SGML application.

This entire issue is already documented; see the discussion of
convert_charref and convert_codepoint in

http://docs.python.org/lib/module-sgmllib.html

Regards,
Martin
Oct 22 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Jeff Bowden | last post by:
I've written a simple class derived from sgmllib.SGMLParser to extract text from html pages. So far it's worked pretty well except for a few cases where I get exceptions. I've managed to work...
1
by: C. Titus Brown | last post by:
Hi all, while playing with PBP/mechanize/ClientForm, I ran into a problem with the way htmllib.HTMLParser was handling encoded tag attributes. Specifically, the following HTML was not being...
3
by: Harlin Seritt | last post by:
I am trying to use SGMLlib module to extract all links from some data I pulled from the web (via urllib). I have looked at the documentation online and can not make sense of it. As a quick example,...
1
by: Sakcee | last post by:
I want to build a simple validator for rss2 feeds, that checks basic structure and reports channels , items , and their attributes etc. I have been reading Mark Pilgrims articles on xml.com,...
6
by: Tony Burrows | last post by:
Just getting to grips with Python, a great language BUT With something like Java I can find the syntax of a method call with no problems, how do I do the same with Python? For example, using...
2
by: Richard Hsu | last post by:
code:- # Internal -- finish processing of end tag def finish_endtag(self, tag): if not tag: # <---- i am confused about this found = len(self.stack) - 1 if found < 0:...
9
by: Mizipzor | last post by:
Is there a way to "subscribe" to individual topics? im currently getting bombarded with daily digests and i wish to only receive a mail when there is activity in a topic that interests me. Can this...
2
by: John Nagle | last post by:
(Was prevously posted as a followup to something else by accident.) I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception: ...
7
by: Eric Anderson | last post by:
I mainly work in other languages (mostly Ruby lately) but my text editor (Scribes) is python. With python being everywhere for dynamic scripting I thought I would read the source to learn the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.