472,353 Members | 1,378 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,353 software developers and data experts.

Java sax UTF-8 parsing troubles -- PLEASE HELP...

Hi there,

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...

this is the complete trace...

[8/29/04 22:38:40:756 GMT-05:00] 692c692c SystemErr R
sun.io.MalformedInputException
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.lang.Throwable.<init>(Throwable.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.convertInto(S treamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.implRead(Stre amDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder.read(StreamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.io.InputStreamReader.read(InputStreamReader.j ava)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.load(Unkno wn Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.scanQName( Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.sc anStartElement(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.DTDConfiguration.parse(U nknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
com.polyorb.tipranavir.pdf.ConvertXML.cparse(Conve rtXML.java)
What am I doing wrong here???

Thank you for any guideance...

Regards, Alex.
Jul 20 '05 #1
5 13927
Aleksandar Matijaca (al****@polyorb.com) wrote:
: Hi there,

: I am in some need of help. I am trying to parse using the apache sax
: parser
: a file that has vaid UTF-8 characters - I keep end up getting a

: sun.io.MalformedInputException error.

: This is my code:

: infile = "<?xml version=\"1.0\"
: encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
: Yen</currency_display></display_values>";

The string in java is not utf-8, it's utf-16, so if you pass the "raw
bytes" of the string to the parser then it isn't utf-8.

However, I haven't ever used the specific set of instructions you are
using, so I don't know for sure that is the problem.

Jul 20 '05 #2


Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes("UTF8")
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #3
MARTIN - THIS FIXED IT!!! It was the infile.getBytes("UTF-8")
Martin and Malcolm, thank you very much for your suggestions.

All the best, Alex.
(Toronto)
Martin Honnen <ma*******@yahoo.de> wrote in message news:<41**********************@newsread4.arcor-online.net>...
Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes("UTF8")

Jul 20 '05 #4
Aleksandar Matijaca wrote:
Hi there,
Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);
ByteArrayInputStream bi = new
ByteArrayInputStream
1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
representation of your String.(infile.getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile.getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader); 3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already) is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah"... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #5
Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc... This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.

I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...
Regards, Alex.

Soren Kuula <dongfang-remove_this@remove_this-bitplanet.net> wrote in message news:<3v********************@news000.worldonline.d k>...
Aleksandar Matijaca wrote:
Hi there,


Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);
ByteArrayInputStream bi = new
ByteArrayInputStream


1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
representation of your String.(infile.getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile.getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);

3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already)
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah"... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: apchar | last post by:
I am trying to use php as a kind of servlet to act as a middle man between a java applet and mysql. I know java has jdbc but it's flakey and...
4
by: knocker | last post by:
Hi I have a problem with JSP on websphere 5. When I try save information with swedish or danish ÅÄÖ characters, the string is cut where the...
0
by: Ray Tayek | last post by:
hi, trying to convert some csv files into xsml and pulling a few hairs out :(. using the files below. a java program will parse the csv and take...
2
by: Robert M. Gary | last post by:
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default character set is EUC-JP I'm seeing two strange things when using Japanese...
7
by: flm | last post by:
I've got an XML document that contains euro signs and looks like : <?xml version="1.0" encoding="utf-8"?> <merchant id="52"> <product...
2
by: jan00000 | last post by:
Hi, I'm using Xalan to do some transforming of XML in Java. My problem is: I have unicode in my XML (i.e., German Umlauts (ä,ö,ü, and since...
16
by: The Ruling Class | last post by:
With all the books and code written in Java, why are there no /killer/ applications? Can you name one Java application that is a must have? I...
2
by: BluNuit | last post by:
I have a simple c# app that calls a Java (AXIS) web service to perform some operations. The call works fine (the java code fires and the operations...
1
by: compurhythms | last post by:
I'm having an issue calling a java-based web service from a C#/.NET 2.0 client that uses WSE 3.0. (No WCF) There is an operation on the web...
0
by: tangara | last post by:
Hi, I'm trying my very best to include Java Utilites bean inside my application but it is just not working. Hope the expert here can give me some...
1
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.