472,337 Members | 1,416 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,337 software developers and data experts.

Unicode problem with Java Xerces DOM

I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.

Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.

Dale Gerdemann
----------------
import org.xml.sax.InputSource;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.DOMImplementationImpl;
import org.xml.sax.SAXException;
import org.w3c.dom.DOMException;
import java.io.IOException;
import org.w3c.dom.Element;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.LineSeparator;
public class AProblem {

public static void main(String[] args)
throws DOMException, IOException, SAXException {

DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream(new
File("foo.xml")));
is.setEncoding("UTF-8");
parser.parse(is);
Document doc = parser.getDocument();
Element root = doc.getDocumentElement();
System.out.println(root.getChildNodes().item(0));

OutputFormat format = new OutputFormat(doc);
format.setLineSeparator(LineSeparator.Unix);

format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
format.setEncoding("UTF-8");
FileWriter fw = new FileWriter("bar.xml");

XMLSerializer serializer = new XMLSerializer(fw, format);
serializer.serialize(doc);
}
}
Jul 20 '05 #1
2 9778
Dale Gerdemann wrote:
I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.

Dale,

What JDK are you using and under which env?

Regards,
Kenneth
Jul 20 '05 #2
In article <93**************************@posting.google.com >,
dg@sfs.nphil.uni-tuebingen.de (Dale Gerdemann) wrote:
:I'm having trouble with Unicode encoding in DOM. As a simple example,
:I read in a UTF-8 encoded xml file such as:
:
:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
:
:<aText>letter 'a' with umlaut: ä</aText>
:
:And when I serialize it, it comes out encoded as ISO-8895-1. But I
:don't think the problem is with serialization. In processing my XML
:files, I'm matching bits and pieces of text and attributes with some
:Unicode/UTF-8 text read in from another souce. When the strings in my
:XML file contain non-ASCII characters, then I have problems.
:
:Hopefully, I've explained the problem enough so that someone can help.
:In case it's necessary, I attach at the end, a bit of code for reading
:in and serializing a DOM.
:
:Dale Gerdemann
:----------------
:import org.xml.sax.InputSource;
:import java.io.FileInputStream;
:import java.io.File;
:import java.io.FileWriter;
:import org.w3c.dom.Document;
:import org.apache.xerces.parsers.DOMParser;
:import org.apache.xerces.dom.DOMImplementationImpl;
:import org.xml.sax.SAXException;
:import org.w3c.dom.DOMException;
:import java.io.IOException;
:import org.w3c.dom.Element;
:import org.apache.xml.serialize.OutputFormat;
:import org.apache.xml.serialize.XMLSerializer;
:import org.apache.xml.serialize.LineSeparator;
:
:
:public class AProblem {
:
: public static void main(String[] args)
: throws DOMException, IOException, SAXException {
:
: DOMParser parser = new DOMParser();
: InputSource is = new InputSource(new FileInputStream(new
:File("foo.xml")));
: is.setEncoding("UTF-8");
: parser.parse(is);
: Document doc = parser.getDocument();
: Element root = doc.getDocumentElement();
: System.out.println(root.getChildNodes().item(0));
:
:
:
: OutputFormat format = new OutputFormat(doc);
: format.setLineSeparator(LineSeparator.Unix);
:
: format.setIndenting(true);
: format.setLineWidth(0);
: format.setPreserveSpace(true);
: format.setEncoding("UTF-8");
: FileWriter fw = new FileWriter("bar.xml");
:
: XMLSerializer serializer = new XMLSerializer(fw, format);
: serializer.serialize(doc);
:
:
: }
:}


I've encountered this problem myself. The solution was to use something
besides a FileWriter to output your new XML document, since you need to
encode both the XML data and the data written to an external file.

Your OutputFormat object specifies that the XML gets UTF-8 encoding, but
the FileWriter will use your system's default encoding. What I use now
is an OutputStreamWriter with its constructor taking an OutputStream (I
use a FileOutputStream) and a String naming the encoding. That solved
the problem for me.

I also note that you're specifying UTF-8 on input. While I doubt it
does any harm, it shouldn't be necessary.

Hope this helps.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama
Jul 20 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Michael | last post by:
Hello I am trying to write a Java-Program which converts a XML-file in a HTML. It should take the Transformation-file from the XML-file itself. ...
0
by: peter greaves | last post by:
hi everyone i am having a bad time with an entity resolver. my application uses a resolver to locally-cache the nested schemas that the basic...
3
by: Bobo | last post by:
I am getting the following error while trying to process an XML string with unicode in its content. :12:0: An invalid XML character (Unicode:...
0
by: Dale Gerdemann | last post by:
I've been trying to use DOM level 3 with xerces-2_6_2. There's a sample called samples/DOM3.java, but I've had trouble with compilation. I've...
4
by: SL | last post by:
Hi, Im' using Xerces-j (version 2.0.1 and 2.6.2). When parsing this prolog : <!DOCTYPE teiCorpus PUBLIC "-//TEI Consortium//DTD TEI P4//EN"...
2
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly...
1
by: Jens Mueller | last post by:
Hi there, this is a Java-XML Question, so I am not sure whether this is the right place, haven't found anything better .... I try to convert a...
18
by: jacksu | last post by:
I have a simple program to run xpath with xerces 1_2_7 XPathFactory factory = XPathFactory.newInstance(); XPath xPath = factory.newXPath(); ...
1
by: Simon Brooke | last post by:
Yet another silly question, but this just might be the crucial one. In answer to another of my silly questions, Björn Höhrmann pointed me to...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.