sgmlop: malformed charrefs?

Magnus Lie Hetland

According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity? I've tried to
read the C code, but I can't say that left me any wiser on the
subject; it doesn't seem to have any special-casing for this that I
can find.

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)? I'm
trying to write a parser that will accept *any* input text without
complaining -- but simply trapping this exception would seem to
disrupt the parsing process...

Thanks,

- Magnus

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx

Jul 18 '05 #1

Subscribe Post Reply

1640

Fredrik Lundh

Magnus Lie Hetland wrote:

According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity?
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>

Jul 18 '05 #2

Magnus Lie Hetland

In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:

Magnus Lie Hetland wrote: [snip]with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
OK, thanks. I guess I just wasn't creative enough in my entity naming
:)

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.
Strange. It doesn't seem to work that way for me... Here is an example:

.................................................. .....................
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('�')
except Exception, e:
print e
.................................................. .....................

When I run this, I get:

character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds sys.maxunicode (0xffff)

If I remove the handle_data, nothing happens.
</F>

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx

Jul 18 '05 #3

Fredrik Lundh

Magnus Lie Hetland wrote:

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

Strange. It doesn't seem to work that way for me... Here is an example:

from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

</F>

Jul 18 '05 #4

Magnus Lie Hetland

In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
[snip]

are the PyXML folks shipping the latest sgmlop?
I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)
I'm pretty sure they've forked the code (there's no UnicodeParser in
the effbot.org edition),
Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?
and I have no idea how things work in the fork.

I see.

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx

Jul 18 '05 #5

Martin v. Löwis

Fredrik Lundh wrote:

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin

Jul 18 '05 #6

Martin v. Löwis

Fredrik Lundh wrote:

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin

Jul 18 '05 #7

Fredrik Lundh

Martin v. Löwis wrote:

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

</F>

Jul 18 '05 #8

Magnus Lie Hetland

In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:

Martin v. Löwis wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure
they've forked the code (there's no UnicodeParser in the
effbot.org edition), and I have no idea how things work in the
fork.

As we've forked the code, the answer is a clear "yes" :-) It
certainly is the latest release of the fork.

if the 2000-07-05 date is correct, there has been at least eight
public releases of the original sgmlop distribution since the fork.

Hm. This may, of course, be just fine -- but it seems a bit
unfortunate to me... I.e. nice features added in each of the two, but
no distribution where all the features are available... Or something.
(Or at least all the bug fixes :)

Is there any chance of at least sharing fixes for thins such as the
illegal charrefs becoming entity refs etc.? (Yeah, I know, I can
submit patches, but I don't know the code all that well...)

Or: What are the chances of handling Unicode with the Effbot sgmlop
(which seems to be the only feature I'm missing in that at the
moment)? Using UTF-8 or something would be completely acceptable to
me, as long as it works. (Maybe simply feeding it UTF-8 strings would
work as it is? Except for Unicode charrefs, of course... Or?)

- M

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx

Jul 18 '05 #9

Similar topics

htmllib.py and parsing malformed HTML

by: KC | last post by:

I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes...

Python

Malformed header error... pls help!

by: Kiran B. | last post by:

hello, Im getting this error everytime i load this page. malformed header from script. Bad header=*** You don't have the (right): c:/program files/apache group/apache/cgi-bin/fig28_18.py I...

Python

OSX / Python 2.3 error"truncated or malformed object ..."

by: Ian A. York | last post by:

MacOS 10.3.8, Python 2.3. I installed both Tkinter and appscript yesterday. Now when I open python (or pythonw) in the Terminal I get the following: Python 2.3 (#1, Sep 13 2003, 00:49:11) on...

Python

Malformed Packet

by: Ryan R. Tharp | last post by:

------=_NextPart_000_08EB_01C34AD3.62428F70 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Using the C API I'm getting this on some of my queries,...

MySQL Database

Malformed page crashes MSIE 6 100% of the time

by: Gérard Talbot | last post by:

About a month ago, I've reported a crash bug at a wiki webpage where developers of MSIE 7 can read feedback from developers. The strange thing is that a particular malformed webpage can crash MSIE...

HTML / CSS

what is malformed about this url?

by: sviau | last post by:

http://www.mls.ca/PropertyDetails.aspx?vd=&SearchURL=%3fMode%3d0%26Page%3d1%26vs%3d1%26rlt%3d%26cp%3d%26pt%3d1%26mp%3d0-0-0%26mrt%3d-1-0-0%26Beds%3d0-0%...

ASP.NET

Python2.5 RC1 vs sgmlop.c

by: Robin Becker | last post by:

I have a segfault problem in Python2.5 RC1 (win32) when using the venerable extension sgmlop.c. In case that was just because our copy was very old I downloaded a later source from...

Python

Malformed Header from script. Bad header.

by: Shalako | last post by:

I check my error log and see these entries: malformed header from script. Bad header= Missing gauge reports are ind: padata.pl /perl/pema/padata.pl did not send an HTTP header malformed...

Apache Web Server

sgmlop - xmlrpclib - bang

by: Grzegorz Makarewicz | last post by:

simple data for simple test - my version fails after 10 loops after removing sgmlop.pyd from DLLs - everything is working as expected mak #python data='''\ <?xml version="1.0"?>...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General