Strangeness with Japanese, XML, Java

Robert M. Gary

I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

Has anyone else seen this??
Here is my transformer...

Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);

Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );
Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severi ty</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-

However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......

So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert

Jul 20 '05 #1

Subscribe Post Reply

3314

Soren Kuula

Hi
Robert M. Gary wrote:

I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets... 1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.
1) Find out under which encoding your java source editor saves your java
source files. Check your result.

2) javac -encoding <whatever you found above> ...java
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
How do you serialize your DOMs? I guess you will have
UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
Writer that takes the platform default encoding ... that's a mess :)

Check that you override the platform default encoding and really go
UTF-8 when you serialize.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.
Other character sets (UTC, etc) result in a different document.
Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).
I'm pretty sure the error is where you output the data (you haven't
shown it..)
Has anyone else seen this??
All the time...
Document new_document = documentBuilder.parse("japan2.xml");
Verify until you are bloody sure what the encoding is of your input
document, and that it really matches with what the header says.
I think a mismatch will not result in an exception or anything, only bad
contents... System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );
PSE show us how it got into that file. Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"

....

Soren

Jul 20 '05 #2

Soren Kuula

Hi, Robert and myself,
Soren Kuula wrote:

Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).

Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.
Other character sets (UTC, etc) result in a different document.

Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.

Sorry, I misunderstood you there .. you mean, the OUTput is identical
except for the header?

I would take that as an indication that whatever you use for serializing
the DOM a byte sequence (file) does not look at what you set the
transformer to use. You will have to control that elsewhere.

Are you by any chance instantiating your own Writers when serializing?
Tried to give them different sencoding settings?

Soren

Jul 20 '05 #3

by: David Thomas | last post by:

Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with...

PHP

ZipFile.java doesn't support Japanese file names inside the zip file (I18N)

by: irit hasid | last post by:

Hi, I've been trying to use JDK's ZipFile.java in order to extract a zipped file. the files inside my zip file have japanese names, and it throws an exception. there's a bug at SUN's database...

Java

Japanese in MySQL

by: Keitaro | last post by:

Hi, I am developing a vocabulary trainer. The languages I want to support are German and Japanese. When I try to store Japanese characters in my mysql database, I do not get the correct...

MySQL Database

Japanese words

by: Emilio Perez | last post by:

Hi, I'm trying to read a varchar(50) field writed in Japanese using this sentence: is = rset.getBinaryStream(num); at that sentence the JDBC driver shows the following error: ...

Microsoft SQL Server

japanese language

by: Sven Hanefeld | last post by:

Hi, who is able to help. I want to translate my whole website into japanese language. www.nord-com.net/hanefeld What do I have to do except finding a japanese guy for the translation. Do I...

Javascript

displaying japanese text on English OS

by: Benoit Martin | last post by:

in my windows app, I have some japanese text that I load from a text file and display on a label. No matter what type of encoding I try to use on the text file, the text always comes up as a bunch...

.NET Framework

Strangeness with CASE stmts and ATOMIC keyword

by: Seenu | last post by:

I'm experiencing some puzzling behaviour with some of my UDFs when declaring them as ATOMIC.. Basically I'm invoking another UDF (which uses some Java code) in one branch of a CASE statment, and...

DB2 Database

Zip a file with Chinese or Japanese file name

by: Shrek | last post by:

HELP: I write a C# class which uses J# class java.io.FileOutputStream,java.util.ZipInputStream and so on to do oprations on zip file. For example: When I add a file '1.txt' into...

ASP.NET

openings on java with japanese language

by: suresh.reddy | last post by:

send your profile to suresh.reddy@isquaresoft.com Job Code: JJ1 Job Description: Skills: Experience in Java, J2EE, Struts and servlets. This is for onsite (Japan) and should be proficient...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Strangeness with Japanese, XML, Java

Similar topics