473,721 Members | 1,930 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Java sax UTF-8 parsing troubles -- PLEASE HELP...

Hi there,

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.Malforme dInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
Yen</currency_displa y></display_values> ";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerc es.parsers.SAXP arser();

xr.setContentHa ndler(this);
xr.setErrorHand ler(this);

ByteArrayInputS tream bi = new
ByteArrayInputS tream(infile.ge tBytes());
Reader reader = new InputStreamRead er(bi,"UTF-8");
InputSource is = new InputSource(rea der);
is.setEncoding( "UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...

this is the complete trace...

[8/29/04 22:38:40:756 GMT-05:00] 692c692c SystemErr R
sun.io.Malforme dInputException
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.lang.Throw able.<init>(Thr owable.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.io.ByteToCh arUTF8.convert( ByteToCharUTF8. java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.Stre amDecoder$Conve rterSD.convertI nto(StreamDecod er.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.Stre amDecoder$Conve rterSD.implRead (StreamDecoder. java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.Stre amDecoder.read( StreamDecoder.j ava)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.io.InputSt reamReader.read (InputStreamRea der.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.impl.XMLEnti tyScanner.load( Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.impl.XMLEnti tyScanner.scanQ Name(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.impl.XMLNSDo cumentScannerIm pl.scanStartEle ment(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.impl.XMLDocu mentFragmentSca nnerImpl$Fragme ntContentDispat cher.dispatch(U nknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.impl.XMLDocu mentFragmentSca nnerImpl.scanDo cument(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.parsers.XML1 1Configuration. parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.parsers.DTDC onfiguration.pa rse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.parsers.XMLP arser.parse(Unk nown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerc es.parsers.Abst ractSAXParser.p arse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
com.polyorb.tip ranavir.pdf.Con vertXML.cparse( ConvertXML.java )
What am I doing wrong here???

Thank you for any guideance...

Regards, Alex.
Jul 20 '05 #1
5 14058
Aleksandar Matijaca (al****@polyorb .com) wrote:
: Hi there,

: I am in some need of help. I am trying to parse using the apache sax
: parser
: a file that has vaid UTF-8 characters - I keep end up getting a

: sun.io.Malforme dInputException error.

: This is my code:

: infile = "<?xml version=\"1.0\"
: encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
: Yen</currency_displa y></display_values> ";

The string in java is not utf-8, it's utf-16, so if you pass the "raw
bytes" of the string to the parser then it isn't utf-8.

However, I haven't ever used the specific set of instructions you are
using, so I don't know for sure that is the problem.

Jul 20 '05 #2


Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.Malforme dInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
Yen</currency_displa y></display_values> ";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerc es.parsers.SAXP arser();

xr.setContentHa ndler(this);
xr.setErrorHand ler(this);

ByteArrayInputS tream bi = new
ByteArrayInputS tream(infile.ge tBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes ("UTF8")
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #3
MARTIN - THIS FIXED IT!!! It was the infile.getBytes ("UTF-8")
Martin and Malcolm, thank you very much for your suggestions.

All the best, Alex.
(Toronto)
Martin Honnen <ma*******@yaho o.de> wrote in message news:<41******* *************** @newsread4.arco r-online.net>...
Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.Malforme dInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
Yen</currency_displa y></display_values> ";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerc es.parsers.SAXP arser();

xr.setContentHa ndler(this);
xr.setErrorHand ler(this);

ByteArrayInputS tream bi = new
ByteArrayInputS tream(infile.ge tBytes());


I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes ("UTF8")

Jul 20 '05 #4
Aleksandar Matijaca wrote:
Hi there,
Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
Yen</currency_displa y></display_values> ";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerc es.parsers.SAXP arser();

xr.setContentHa ndler(this);
xr.setErrorHand ler(this);
ByteArrayInputS tream bi = new
ByteArrayInputS tream
1a) (as before): getBytes() returns PLATFORM_DEFAUL T-ENCODED byte string
representation of your String.(infile. getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile. getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again Reader reader = new InputStreamRead er(bi,"UTF-8");
InputSource is = new InputSource(rea der); 3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already) is.setEncoding( "UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME _OTHER_ENCODING _ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah" ... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #5
Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc... This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.

I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...
Regards, Alex.

Soren Kuula <dongfang-remove_this@rem ove_this-bitplanet.net> wrote in message news:<3v******* *************@n ews000.worldonl ine.dk>...
Aleksandar Matijaca wrote:
Hi there,


Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_v alues><currency _display>\u00A5 Japanese
Yen</currency_displa y></display_values> ";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerc es.parsers.SAXP arser();

xr.setContentHa ndler(this);
xr.setErrorHand ler(this);
ByteArrayInputS tream bi = new
ByteArrayInputS tream


1a) (as before): getBytes() returns PLATFORM_DEFAUL T-ENCODED byte string
representation of your String.(infile. getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile. getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again
Reader reader = new InputStreamRead er(bi,"UTF-8");
InputSource is = new InputSource(rea der);

3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already)
is.setEncoding( "UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...


- because UTF_DECODE(SOME _OTHER_ENCODING _ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah" ... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren

Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
3382
by: apchar | last post by:
I am trying to use php as a kind of servlet to act as a middle man between a java applet and mysql. I know java has jdbc but it's flakey and painful. php access to mysql is much nicer. So I have: 1. An html page that holds the applet. 2. a php page that accepts data submitted to it by the applet via the $_POST array and writes it to the mysql database. This page never makes it to the browser window. 3. a simple Thank you page that shows...
4
11644
by: knocker | last post by:
Hi I have a problem with JSP on websphere 5. When I try save information with swedish or danish ÅÄÖ characters, the string is cut where the first of these characters occurs. The JDK used is 1.3.1 I've tried: String CUNM = request.getParameter("CUNM").trim(); CUNM = URLDecoder.decode(CUNM,"UTF-8");
0
2168
by: Ray Tayek | last post by:
hi, trying to convert some csv files into xsml and pulling a few hairs out :(. using the files below. a java program will parse the csv and take care of strange names and notes that line breaks in them. eventiually i want to generate the xslt from a xml file (all the files and java code are at http://tayek.com/~ray/spy1/). the java code (1.4) does the right thing (it just does the transformation). xmlspy seems really broken when i hit...
2
3349
by: Robert M. Gary | last post by:
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default character set is EUC-JP I'm seeing two strange things when using Japanese character sets... 1) If I write a program that does System.out.println("$^%$%^^" ); //assume those are Japanese characters that are multibyte under EUC-JP The resulting output looks NOTHING like the characters I typed in. Apparently the character set being used to read the literal is...
7
2092
by: flm | last post by:
I've got an XML document that contains euro signs and looks like : <?xml version="1.0" encoding="utf-8"?> <merchant id="52"> <product offerid="03543068131" deliverycost="6,90 €" /> ....
2
4802
by: jan00000 | last post by:
Hi, I'm using Xalan to do some transforming of XML in Java. My problem is: I have unicode in my XML (i.e., German Umlauts (ä,ö,ü, and since they trouble me, I did not try out any other unicode characters). When I do an Identity Transform and output the XMl to a File, the word 'Glättegefahr', for example, will appear in my File (viewed with XMLSpy Eclipse-PlugIn) as 'Gl㳴egefahr' (except that the ? is a box instead of a ? .
16
2127
by: The Ruling Class | last post by:
With all the books and code written in Java, why are there no /killer/ applications? Can you name one Java application that is a must have? I run Suse 9.0 -- I don't see a single Java application on my OS. I can't think of a Java application that I would actually want or need? The Java Community just seems like a big, huge self-congratulatory group of
2
8848
by: BluNuit | last post by:
I have a simple c# app that calls a Java (AXIS) web service to perform some operations. The call works fine (the java code fires and the operations are executed), but the response is always null. Other applications (besides the c# ones) are able to execute the web service and get the proper response. I put a SOAP sniffer on the call and the response was: 193 <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope...
1
8647
by: compurhythms | last post by:
I'm having an issue calling a java-based web service from a C#/.NET 2.0 client that uses WSE 3.0. (No WCF) There is an operation on the web service that takes a single base64 encoded parameter that is transported as a MTOM mime part in a multipart request. The web service provider has sent me a trace of a successful SOAP request and I have traced my SOAP request and I do not see any differences other than differences in how the...
0
1293
by: tangara | last post by:
Hi, I'm trying my very best to include Java Utilites bean inside my application but it is just not working. Hope the expert here can give me some advice. Thanks. My code for the package is name adroit. package adroit; import java.sql.*;
0
9367
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9215
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9131
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6669
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5981
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4753
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3189
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2576
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2130
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.