By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,910 Members | 1,293 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,910 IT Pros & Developers. It's quick & easy.

Java sax UTF-8 parsing troubles -- PLEASE HELP...

P: n/a
Hi there,

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...

this is the complete trace...

[8/29/04 22:38:40:756 GMT-05:00] 692c692c SystemErr R
sun.io.MalformedInputException
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.lang.Throwable.<init>(Throwable.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.convertInto(S treamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.implRead(Stre amDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder.read(StreamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.io.InputStreamReader.read(InputStreamReader.j ava)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.load(Unkno wn Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.scanQName( Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.sc anStartElement(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.DTDConfiguration.parse(U nknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
com.polyorb.tipranavir.pdf.ConvertXML.cparse(Conve rtXML.java)
What am I doing wrong here???

Thank you for any guideance...

Regards, Alex.
Jul 20 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Aleksandar Matijaca (al****@polyorb.com) wrote:
: Hi there,

: I am in some need of help. I am trying to parse using the apache sax
: parser
: a file that has vaid UTF-8 characters - I keep end up getting a

: sun.io.MalformedInputException error.

: This is my code:

: infile = "<?xml version=\"1.0\"
: encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
: Yen</currency_display></display_values>";

The string in java is not utf-8, it's utf-16, so if you pass the "raw
bytes" of the string to the parser then it isn't utf-8.

However, I haven't ever used the specific set of instructions you are
using, so I don't know for sure that is the problem.

Jul 20 '05 #2

P: n/a


Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes("UTF8")
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #3

P: n/a
MARTIN - THIS FIXED IT!!! It was the infile.getBytes("UTF-8")
Martin and Malcolm, thank you very much for your suggestions.

All the best, Alex.
(Toronto)
Martin Honnen <ma*******@yahoo.de> wrote in message news:<41**********************@newsread4.arcor-online.net>...
Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes("UTF8")

Jul 20 '05 #4

P: n/a
Aleksandar Matijaca wrote:
Hi there,
Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);
ByteArrayInputStream bi = new
ByteArrayInputStream
1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
representation of your String.(infile.getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile.getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader); 3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already) is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah"... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #5

P: n/a
Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc... This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.

I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...
Regards, Alex.

Soren Kuula <dongfang-remove_this@remove_this-bitplanet.net> wrote in message news:<3v********************@news000.worldonline.d k>...
Aleksandar Matijaca wrote:
Hi there,


Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);
ByteArrayInputStream bi = new
ByteArrayInputStream


1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
representation of your String.(infile.getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile.getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);

3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already)
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah"... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.