By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,333 Members | 1,843 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,333 IT Pros & Developers. It's quick & easy.

Errors parsing Japanese chars

P: n/a
I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
I am quite new to xml and I am not sure how to deal with this
situation.
Is there a way to parse the jap characters ? or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.
Jul 20 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On Tue, Jul 8, Sriv Chakravarthy inscribed on the eternal scroll:
I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
Seems to be indicating that the Japanese characters are not in fact
encided in utf-8, then.
I am quite new to xml and I am not sure how to deal with this
situation.
Irrespective of xml or not xml, any text file needs to be accompanied
with information on its encoding if it's to be reliably read. (Modulo
some heuristics which claim to auto-recognise a limited number of
encodings[1]).
Is there a way to parse the jap characters ?
If I've understood what you're reporting, it's not a matter of
_parsing_ them, it's a matter of understanding them in the first
place.
or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.


Not necessarily. And indeed it's a most inefficent way to represent
them if a large quantity of CJK text is involved. But yes, it's
certainly a legal possibility.

Can you view your data (e.g as plain text) in a web browser? (Or if
you haven't got a web browser, try MSIE...) Which character coding
does the browser need to be set to in order to make sense of the
Japanese? (You might try its auto recognition options and if it's
successful, then check to see which encoding it has chosen).

Then, if the encoding is one that's supported by the parser software,
just nominate it on the <?xml... thingy.

hope this helps.

[1] or of course the BOM, if you know for a fact that it's
a unicode encoding that you're dealing with.
Jul 20 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.