469,336 Members | 5,933 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,336 developers. It's quick & easy.

Errors parsing Japanese chars

I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
I am quite new to xml and I am not sure how to deal with this
situation.
Is there a way to parse the jap characters ? or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.
Jul 20 '05 #1
1 2940
On Tue, Jul 8, Sriv Chakravarthy inscribed on the eternal scroll:
I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
Seems to be indicating that the Japanese characters are not in fact
encided in utf-8, then.
I am quite new to xml and I am not sure how to deal with this
situation.
Irrespective of xml or not xml, any text file needs to be accompanied
with information on its encoding if it's to be reliably read. (Modulo
some heuristics which claim to auto-recognise a limited number of
encodings[1]).
Is there a way to parse the jap characters ?
If I've understood what you're reporting, it's not a matter of
_parsing_ them, it's a matter of understanding them in the first
place.
or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.


Not necessarily. And indeed it's a most inefficent way to represent
them if a large quantity of CJK text is involved. But yes, it's
certainly a legal possibility.

Can you view your data (e.g as plain text) in a web browser? (Or if
you haven't got a web browser, try MSIE...) Which character coding
does the browser need to be set to in order to make sense of the
Japanese? (You might try its auto recognition options and if it's
successful, then check to see which encoding it has chosen).

Then, if the encoding is one that's supported by the parser software,
just nominate it on the <?xml... thingy.

hope this helps.

[1] or of course the BOM, if you know for a fact that it's
a unicode encoding that you're dealing with.
Jul 20 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Aleksandar Matijaca | last post: by
2 posts views Thread by Robert M. Gary | last post: by
16 posts views Thread by Christopher Benson-Manica | last post: by
4 posts views Thread by tim | last post: by
21 posts views Thread by Doug Lerner | last post: by
13 posts views Thread by Chris Carlen | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by Marylou17 | last post: by
1 post views Thread by Marylou17 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.