473,322 Members | 1,501 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

UTF-8 incorrect from org.apache.xml.serialize.XMLSerializer

I must be missing something.

I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
getting non-basic characters converted to UTF-8.

I create Text nodes in the DOM by, for example:

Document doc;
JTextArea textPrompt;
Text newTextNode;
Element descElt;
....
newTextNode = doc.createTextNode(textPrompt.getText());
descElt.appendChild(newTextNode);

The code to serialize the DOM is:

private void saveXml(Document document)
{
// rename the existing layout file
new File(fileName).renameTo(new File(fileName + "~"));
// write the document out
OutputFormat format = new OutputFormat(document);
format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
try {
XMLSerializer serializer;
serializer = new XMLSerializer (
new FileWriter(fileName),
format);
serializer.asDOMSerializer();
serializer.serialize(document);
}
catch (IOException ioe)
{
....
}
}

If I enter a character such as e' (e with acute accent) into the JTextArea
and I look at the XML file using a non-UTF-8-aware editor I see that the e'
has been inserted as a single byte, not as the 2 character UTF-8 escaped
value. If I subsequently try to read the XML file using XERCES it blows up
because of the invalid escape sequence.

How do I get a valid serialization of this DOM into XML using UTF-8?
--
Jim Cobban jc*****@magma.ca
34 Palomino Dr.
Kanata, ON, CANADA
K2M 1M1
+1-613-592-9438

Jul 20 '05 #1
3 4784


Jim Cobban wrote:
I must be missing something.

I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
getting non-basic characters converted to UTF-8.

I create Text nodes in the DOM by, for example:

Document doc;
JTextArea textPrompt;
Text newTextNode;
Element descElt;
...
newTextNode = doc.createTextNode(textPrompt.getText());
descElt.appendChild(newTextNode);

The code to serialize the DOM is:

private void saveXml(Document document)
{
// rename the existing layout file
new File(fileName).renameTo(new File(fileName + "~"));
// write the document out
OutputFormat format = new OutputFormat(document);
Does it help if you explicitly set
new OutputFormat(document, "UTF-8", true);
?? format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
try {
XMLSerializer serializer;
serializer = new XMLSerializer (
new FileWriter(fileName),
format);
serializer.asDOMSerializer();
serializer.serialize(document);
}
catch (IOException ioe)
{
...
}
}

If I enter a character such as e' (e with acute accent) into the JTextArea
and I look at the XML file using a non-UTF-8-aware editor I see that the e'
has been inserted as a single byte, not as the 2 character UTF-8 escaped
value. If I subsequently try to read the XML file using XERCES it blows up
because of the invalid escape sequence.

How do I get a valid serialization of this DOM into XML using UTF-8?


--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #2

"Martin Honnen" <ma*******@yahoo.de> wrote in message
news:3f********@olaf.komtel.net...


Does it help if you explicitly set
new OutputFormat(document, "UTF-8", true);
??

Martin Honnen
http://JavaScript.FAQTs.com/

No. Explicitly setting the format does not change the behavior. The
non-basic character is still inserted into the output as a single character
rather than as a 2 character UTF-8 escape as it should be.
Jul 20 '05 #3

"Jim Cobban" <jc*****@magma.ca> wrote in message
news:JY********************@magma.ca...
I must be missing something.
I was misunderstanding something:
XMLSerializer serializer;
serializer = new XMLSerializer (
new FileWriter(fileName),
format);


When I replaced this with:

serializer = new XMLSerializer (
new FileOutputStream(fileName),
format);

it worked correctly. The problem was that by passing in a FileWriter, which
is constructed with the default encoding, there was no opportunity to
specify the UTF-8 encoding. The second format permits the new instance of
XMLSerializer to supply the correct encoding when it constructs the instance
of OutputWriter under the covers.

Basically my mistake was copying sample code from the distribution without
taking the time to understand exactly what it was doing. Once I took that
time I realized that I was using the wrong constructor.
Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: lawrence | last post by:
Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work?...
4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
12
by: Mike Dee | last post by:
A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I...
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
7
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
1
by: sheldon.regular | last post by:
I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. äöü äöü '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...
10
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.