473,396 Members | 2,068 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

sgmlop: malformed charrefs?

According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity? I've tried to
read the C code, but I can't say that left me any wiser on the
subject; it doesn't seem to have any special-casing for this that I
can find.

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)? I'm
trying to write a parser that will accept *any* input text without
complaining -- but simply trapping this exception would seem to
disrupt the parsing process...

Thanks,

- Magnus

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #1
8 1640
Magnus Lie Hetland wrote:
According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity?
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?


if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>

Jul 18 '05 #2
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
Magnus Lie Hetland wrote: [snip]with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
OK, thanks. I guess I just wasn't creative enough in my entity naming
:)
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?


if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.


I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.
Strange. It doesn't seem to work that way for me... Here is an example:

.................................................. .....................
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('�')
except Exception, e:
print e
.................................................. .....................

When I run this, I get:

character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds sys.maxunicode (0xffff)

If I remove the handle_data, nothing happens.
</F>


--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #3
Magnus Lie Hetland wrote:
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.


Strange. It doesn't seem to work that way for me... Here is an example:

from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser


are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

</F>

Jul 18 '05 #4
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
[snip]
are the PyXML folks shipping the latest sgmlop?
I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)
I'm pretty sure they've forked the code (there's no UnicodeParser in
the effbot.org edition),
Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?
and I have no idea how things work in the fork.


I see.

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #5
Fredrik Lundh wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin
Jul 18 '05 #6
Fredrik Lundh wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin
Jul 18 '05 #7
Martin v. Löwis wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.


if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

</F>

Jul 18 '05 #8
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
Martin v. Löwis wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure
they've forked the code (there's no UnicodeParser in the
effbot.org edition), and I have no idea how things work in the
fork.


As we've forked the code, the answer is a clear "yes" :-) It
certainly is the latest release of the fork.


if the 2000-07-05 date is correct, there has been at least eight
public releases of the original sgmlop distribution since the fork.


Hm. This may, of course, be just fine -- but it seems a bit
unfortunate to me... I.e. nice features added in each of the two, but
no distribution where all the features are available... Or something.
(Or at least all the bug fixes :)

Is there any chance of at least sharing fixes for thins such as the
illegal charrefs becoming entity refs etc.? (Yeah, I know, I can
submit patches, but I don't know the code all that well...)

Or: What are the chances of handling Unicode with the Effbot sgmlop
(which seems to be the only feature I'm missing in that at the
moment)? Using UTF-8 or something would be completely acceptable to
me, as long as it works. (Maybe simply feeding it UTF-8 strings would
work as it is? Except for Unicode charrefs, of course... Or?)

- M

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: KC | last post by:
I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes...
1
by: Kiran B. | last post by:
hello, Im getting this error everytime i load this page. malformed header from script. Bad header=*** You don't have the (right): c:/program files/apache group/apache/cgi-bin/fig28_18.py I...
0
by: Ian A. York | last post by:
MacOS 10.3.8, Python 2.3. I installed both Tkinter and appscript yesterday. Now when I open python (or pythonw) in the Terminal I get the following: Python 2.3 (#1, Sep 13 2003, 00:49:11) on...
0
by: Ryan R. Tharp | last post by:
------=_NextPart_000_08EB_01C34AD3.62428F70 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Using the C API I'm getting this on some of my queries,...
3
by: Gérard Talbot | last post by:
About a month ago, I've reported a crash bug at a wiki webpage where developers of MSIE 7 can read feedback from developers. The strange thing is that a particular malformed webpage can crash MSIE...
6
by: sviau | last post by:
http://www.mls.ca/PropertyDetails.aspx?vd=&SearchURL=%3fMode%3d0%26Page%3d1%26vs%3d1%26rlt%3d%26cp%3d%26pt%3d1%26mp%3d0-0-0%26mrt%3d-1-0-0%26Beds%3d0-0%...
0
by: Robin Becker | last post by:
I have a segfault problem in Python2.5 RC1 (win32) when using the venerable extension sgmlop.c. In case that was just because our copy was very old I downloaded a later source from...
1
by: Shalako | last post by:
I check my error log and see these entries: malformed header from script. Bad header= Missing gauge reports are ind: padata.pl /perl/pema/padata.pl did not send an HTTP header malformed...
0
by: Grzegorz Makarewicz | last post by:
simple data for simple test - my version fails after 10 loops after removing sgmlop.pyd from DLLs - everything is working as expected mak #python data='''\ <?xml version="1.0"?>...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.