By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,427 Members | 1,354 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,427 IT Pros & Developers. It's quick & easy.

sgmlop: malformed charrefs?

P: n/a
According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity? I've tried to
read the C code, but I can't say that left me any wiser on the
subject; it doesn't seem to have any special-casing for this that I
can find.

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)? I'm
trying to write a parser that will accept *any* input text without
complaining -- but simply trapping this exception would seem to
disrupt the parsing process...

Thanks,

- Magnus

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Magnus Lie Hetland wrote:
According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as &#foo;. In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity?
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?


if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>

Jul 18 '05 #2

P: n/a
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
Magnus Lie Hetland wrote: [snip]with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
OK, thanks. I guess I just wasn't creative enough in my entity naming
:)
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?


if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.


I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.
Strange. It doesn't seem to work that way for me... Here is an example:

.................................................. .....................
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('�')
except Exception, e:
print e
.................................................. .....................

When I run this, I get:

character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds ASCII range
character reference &#x540be3ff; exceeds sys.maxunicode (0xffff)

If I remove the handle_data, nothing happens.
</F>


--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #3

P: n/a
Magnus Lie Hetland wrote:
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.


Strange. It doesn't seem to work that way for me... Here is an example:

from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser


are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

</F>

Jul 18 '05 #4

P: n/a
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
[snip]
are the PyXML folks shipping the latest sgmlop?
I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)
I'm pretty sure they've forked the code (there's no UnicodeParser in
the effbot.org edition),
Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?
and I have no idea how things work in the fork.


I see.

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #5

P: n/a
Fredrik Lundh wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin
Jul 18 '05 #6

P: n/a
Fredrik Lundh wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.

Regards,
Martin
Jul 18 '05 #7

P: n/a
Martin v. Lwis wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.


As we've forked the code, the answer is a clear "yes" :-) It certainly
is the latest release of the fork.


if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

</F>

Jul 18 '05 #8

P: n/a
In article <ma*************************************@python.or g>,
Fredrik Lundh wrote:
Martin v. Lwis wrote:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure
they've forked the code (there's no UnicodeParser in the
effbot.org edition), and I have no idea how things work in the
fork.


As we've forked the code, the answer is a clear "yes" :-) It
certainly is the latest release of the fork.


if the 2000-07-05 date is correct, there has been at least eight
public releases of the original sgmlop distribution since the fork.


Hm. This may, of course, be just fine -- but it seems a bit
unfortunate to me... I.e. nice features added in each of the two, but
no distribution where all the features are available... Or something.
(Or at least all the bug fixes :)

Is there any chance of at least sharing fixes for thins such as the
illegal charrefs becoming entity refs etc.? (Yeah, I know, I can
submit patches, but I don't know the code all that well...)

Or: What are the chances of handling Unicode with the Effbot sgmlop
(which seems to be the only feature I'm missing in that at the
moment)? Using UTF-8 or something would be completely acceptable to
me, as long as it works. (Maybe simply feeding it UTF-8 strings would
work as it is? Except for Unicode charrefs, of course... Or?)

- M

--
Magnus Lie Hetland Time flies like the wind. Fruit flies
http://hetland.org like bananas. -- Groucho Marx
Jul 18 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.