I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...
1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).
Has anyone else seen this??
Here is my transformer...
Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");
//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());
//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );
transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);
transformer.transform(new_source, new_result);
String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );
Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severi ty</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>
When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-
However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......
So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert 2 3314
Hi
Robert M. Gary wrote: I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default character set is EUC-JP I'm seeing two strange things when using Japanese character sets...
1) If I write a program that does System.out.println("$^%$%^^" ); //assume those are Japanese characters that are multibyte under EUC-JP The resulting output looks NOTHING like the characters I typed in. Apparently the character set being used to read the literal is different from the default.
1) Find out under which encoding your java source editor saves your java
source files. Check your result.
2) javac -encoding <whatever you found above> ...java
2) If I create an XML document using the built in DOM which contains elements with values in Japanese, I get strangeness when I transform that into an XML document. If I do not set the character set in the transformer the document will say its in UTF-8 (the XML header will). However, the actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities (it knows nothing of XML, just character sets) and when I try to read the document when telling uconv it is UTF-8 it claims it is invalid UTF-8. However, if I try to read it telling it the document is EUC-JP it says its good.
How do you serialize your DOMs? I guess you will have
UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
Writer that takes the platform default encoding ... that's a mess :)
Check that you override the platform default encoding and really go
UTF-8 when you serialize.
Also, when I change the transformer to use EUC-JP it creates the same document bit-for-bit (other than changing the XML header to say EUC-8).
Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.
Other character sets (UTC, etc) result in a different document.
Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.
So, my conclusion is that by default the XML DOM says its UTF-8 in the header, but ALWAYS uses the platform default unless you specify something else (UTC for example).
I'm pretty sure the error is where you output the data (you haven't
shown it..)
Has anyone else seen this??
All the time...
Document new_document = documentBuilder.parse("japan2.xml");
Verify until you are bloody sure what the encoding is of your input
document, and that it really matches with what the header says.
I think a mismatch will not result in an exception or anything, only bad
contents... System.out.println("I just read japan2.xml"); DOMSource new_source = new DOMSource(new_document); StringWriter new_writer = new StringWriter(); StreamResult new_result = new StreamResult(new_writer); Properties p = transformer.getOutputProperties(); //try explicit EUC //p.setProperty(OutputKeys.ENCODING, "EUC-JP");
//try default (EUC) //p.setProperty(OutputKeys.ENCODING, // new OutputStreamWriter(new ByteArrayOutputStream()).getEncoding());
//try UTF explicityly //p.setProperty(OutputKeys.ENCODING, "UTF-8" );
transformer.setOutputProperties(p); Properties p2 = transformer.getOutputProperties(); p2.list(System.out);
transformer.transform(new_source, new_result);
String new_text_doc = new_writer.toString(); System.out.println("XML doc is "+new_text_doc );
PSE show us how it got into that file. Resulting document... XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq confirmed="true"
....
Soren
Hi, Robert and myself,
Soren Kuula wrote: Also, when I change the transformer to use EUC-JP it creates the same document bit-for-bit (other than changing the XML header to say EUC-8).
Problem is where you serialize the document, not where you construct, modify or transform it. And possibly in the decoding (by javac) of your program text literals.
Other character sets (UTC, etc) result in a different document.
Probably the document is read in correctly .. anything else than unicode and EUC will not be able to contain all the Japanese, and will bust.
Sorry, I misunderstood you there .. you mean, the OUTput is identical
except for the header?
I would take that as an indication that whatever you use for serializing
the DOM a byte sequence (file) does not look at what you set the
transformer to use. You will have to control that elsewhere.
Are you by any chance instantiating your own Writers when serializing?
Tried to give them different sencoding settings?
Soren This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: David Thomas |
last post by:
Hi there, a while ago, I posted a question regarding reading japanese
text from a text file.
Well, since I solved the problem, I thought I'd post my solution for
the benefit of other people with...
|
by: irit hasid |
last post by:
Hi,
I've been trying to use JDK's ZipFile.java in order to extract a
zipped file. the files inside my zip file have japanese names, and it
throws an exception. there's a bug at SUN's database...
|
by: Keitaro |
last post by:
Hi,
I am developing a vocabulary trainer.
The languages I want to support are German and Japanese.
When I try to store Japanese characters in my mysql database, I do not
get the correct...
|
by: Emilio Perez |
last post by:
Hi,
I'm trying to read a varchar(50) field writed in Japanese using this
sentence:
is = rset.getBinaryStream(num);
at that sentence the JDBC driver shows the following error:
...
|
by: Sven Hanefeld |
last post by:
Hi,
who is able to help.
I want to translate my whole website into japanese language.
www.nord-com.net/hanefeld
What do I have to do except finding a japanese guy for the translation.
Do I...
|
by: Benoit Martin |
last post by:
in my windows app, I have some japanese text that I load from a text file
and display on a label. No matter what type of encoding I try to use on the
text file, the text always comes up as a bunch...
|
by: Seenu |
last post by:
I'm experiencing some puzzling behaviour with some of my UDFs when
declaring them as ATOMIC.. Basically I'm invoking another UDF (which
uses some Java code) in one branch of a CASE statment, and...
|
by: Shrek |
last post by:
HELP:
I write a C# class which uses J# class
java.io.FileOutputStream,java.util.ZipInputStream and so on to do oprations
on zip file.
For example:
When I add a file '1.txt' into...
|
by: suresh.reddy |
last post by:
send your profile to suresh.reddy@isquaresoft.com
Job Code: JJ1
Job Description:
Skills: Experience in Java, J2EE, Struts and servlets.
This is for onsite (Japan) and should be proficient...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
| | |