By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,680 Members | 1,710 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,680 IT Pros & Developers. It's quick & easy.

SAX parseing goes 'all funny' on value [en]

P: n/a
Hi,

I am parsing a small xml document and the parseing goes 'all funny'
when parsing this element: <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>

I've created a subclass of org.xml.sax.helpers.DefaultHandler, and an
instance of this subclass is set on my
org.apache.xerces.parsers.SAXParser:

SAXParser parser = new SAXParser();
parser.setContentHandler(pdh);
parser.setErrorHandler(pdh);

I've found that the

public void characters(char[] ch, int offset, int length) throws
SAXException

method is called once per element parsed. my debug output confirms
this. e.g. when parsing <useragent>MobileExplorer/3.00 (Mozilla/1.22;
compatible; MMEF300; Microsoft; Windows; GenericLarge)</useragent> it
reads:

D: reading characters...(useragent) length=89, offset=721,
found='MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300;
Microsoft; Windows; GenericLarge)'
D: ending element (useragent) current element value is :
[MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300; Microsoft;
Windows; GenericLarge)]
But... when parsing <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>
the debug output reads

D: reading characters...(useragent) length=16, offset=1097,
found='Mozilla/4.61 [en'
D: reading characters...(useragent) length=1, offset=0, found=']'
D: reading characters...(useragent) length=11, offset=1114, found='
(WinNT; I)'
D: ending (useragent) current element value is : [ (WinNT; I)]

It calls the characters method trice?!
Does the [en] bit in the element value have anything to do with this?
Would like to understand what and why.

(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)

Thanks for your input.
Fred.
Jul 20 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Fred wrote:
(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)


I think that's how SAX is supposed to work. There's no guarantee that
you're only getting a single event here.
Jul 20 '05 #2

P: n/a
Julian Reschke <ju************@gmx.de> wrote in
news:3F**************@gmx.de:
Fred wrote:
(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)


I think that's how SAX is supposed to work. There's no guarantee that
you're only getting a single event here.


It *is* how SAX is supposed to work. Keep in mind that character data in
XML can be arbitrarily long; if a parser had to deliver character data in a
single chunk, it could find itself constantly allocating and reallocating
buffers. Not imposing such a requirement greatly simplifies buffer
management in a parser; it can use a fixed-size internal buffer and just
call the character handler when everything up to the end of the buffer is
character data, rather than having to shift everything around. That can
greatly speed up parsing.
Jul 20 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.