473,386 Members | 1,758 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

sgmllib problem & proposed fix.

Hi all,

while playing with PBP/mechanize/ClientForm, I ran into a problem with
the way htmllib.HTMLParser was handling encoded tag attributes.

Specifically, the following HTML was not being handled correctly:

<option value="Small (6&quot;)">Small (6)</option>

The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.

It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
based) does not unescape tag attributes. However, HTMLParser.HTMLParser
(the newer, more XHTML-friendly class) does do so.

My proposed fix is to change sgmllib to unescape tags in the same way
that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
Python 2.4 is at the bottom of this message.

I'm posting to this newsgroup before submitting the patch because I'm
not too familiar with these classes and I want to make sure this
behavior is correct.

One question I had was this: as you can see from the code below, a
simple string.replace is done to replace encoded strings with their
unencoded translations. Should handle_entityref be used instead, as
with standard HTML text?

Another question: should this fix, if appropriate, be back-ported to
older versions of Python? (I doubt sgmllib has changed much, so it
should be pretty simple to do.)

thanks for any advice,
--titus

*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting
+ def unescape(self, s):
+ if '&' not in s:
+ return s
+ s = s.replace("&lt;", "<")
+ s = s.replace("&gt;", ">")
+ s = s.replace("&apos;", "'")
+ s = s.replace("&quot;", '"')
+ s = s.replace("&amp;", "&") # Must be last
+
+ return s
+

class TestSGMLParser(SGMLParser):
Jul 18 '05 #1
1 1964
Whoops! Forgot an executable example ;).

Attached, and also available at

http://issola.caltech.edu/~t/transfer/test-enc.py
http://issola.caltech.edu/~t/transfer/test-enc.html

Run 'python test-enc.py test-enc.html' and note that
htmllib.HTMLParser-based parsers give different output than
HTMLParser.HTMLParser-based parsers.

cheers,
--titus

#!/usr/bin/env python2.4
import htmllib
import HTMLParser
import formatter

### a simple mix-in to demonstrate the problem.

class MixinTest:
def start_option(self, attrs):
print '==> OPTION starting', attrs

# Definition of entities -- derived classes may override
entitydefs = \
{'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}

def handle_entityref(self, name):
print '==> HANDLING ENTITY', name
table = self.entitydefs
if name in table:
self.handle_data(table[name])
else:
self.unknown_entityref(name)
return

####

class htmllib_Parser(MixinTest, htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())

class nonhtmllib_Parser(MixinTest, HTMLParser.HTMLParser):
def handle_starttag(self, name, attrs):
"Redirect OPTION tag ==> MixinTest.start_option"
if name == 'option':
self.start_option(attrs)

pass

###

import sys
data = open(sys.argv[1]).read()

print 'PARSING with htmllib.HTMLParser'

htmllib_p = htmllib_Parser()
htmllib_p.feed(data)

print '\nPARSING with HTMLParser.HTMLParser'

nonhtmllib_p = nonhtmllib_Parser()
nonhtmllib_p.feed(data)

Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

125
by: Raymond Hettinger | last post by:
I would like to get everyone's thoughts on two new dictionary methods: def count(self, value, qty=1): try: self += qty except KeyError: self = qty def appendlist(self, key, *values): try:
3
by: Harlin Seritt | last post by:
I am trying to use SGMLlib module to extract all links from some data I pulled from the web (via urllib). I have looked at the documentation online and can not make sense of it. As a quick example,...
2
by: Hans Forbrich | last post by:
The following is a Request for Discussion, following up on some recent posts about distributing a periodic news group 'Charter and FAQ' post. While not formally following RFC/RFD rules, the intent...
4
by: Arturo Cuebas | last post by:
The program below contains a compile error. Following the program you will find the typical fix then my idea for a library that facilitates a more elegant fix. #include <boost\bind.hpp> using...
1
by: Sakcee | last post by:
I want to build a simple validator for rss2 feeds, that checks basic structure and reports channels , items , and their attributes etc. I have been reading Mark Pilgrims articles on xml.com,...
2
by: Michael Butscher | last post by:
Hi, if I execute the following two lines in Python 2.5 (to feed in a *unicode* string): import sgmllib sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
2
by: John Nagle | last post by:
(Was prevously posted as a followup to something else by accident.) I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception: ...
7
by: Eric Anderson | last post by:
I mainly work in other languages (mostly Ruby lately) but my text editor (Scribes) is python. With python being everywhere for dynamic scripting I thought I would read the source to learn the...
7
by: Ioannis Vranos | last post by:
In K&R2 errata page <http://www-db-out.research.bell-labs.com/cm/cs/cbook/2ediffs.html> there are some ambiguous errata, for which I propose solutions. Any comments are welcome. Ambiguous...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.