By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,824 Members | 1,247 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,824 IT Pros & Developers. It's quick & easy.

Unicode problem with Java Xerces DOM

P: n/a
I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.

Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.

Dale Gerdemann
----------------
import org.xml.sax.InputSource;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.DOMImplementationImpl;
import org.xml.sax.SAXException;
import org.w3c.dom.DOMException;
import java.io.IOException;
import org.w3c.dom.Element;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.LineSeparator;
public class AProblem {

public static void main(String[] args)
throws DOMException, IOException, SAXException {

DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream(new
File("foo.xml")));
is.setEncoding("UTF-8");
parser.parse(is);
Document doc = parser.getDocument();
Element root = doc.getDocumentElement();
System.out.println(root.getChildNodes().item(0));

OutputFormat format = new OutputFormat(doc);
format.setLineSeparator(LineSeparator.Unix);

format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
format.setEncoding("UTF-8");
FileWriter fw = new FileWriter("bar.xml");

XMLSerializer serializer = new XMLSerializer(fw, format);
serializer.serialize(doc);
}
}
Jul 20 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Dale Gerdemann wrote:
I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.

Dale,

What JDK are you using and under which env?

Regards,
Kenneth
Jul 20 '05 #2

P: n/a
In article <93**************************@posting.google.com >,
dg@sfs.nphil.uni-tuebingen.de (Dale Gerdemann) wrote:
:I'm having trouble with Unicode encoding in DOM. As a simple example,
:I read in a UTF-8 encoded xml file such as:
:
:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
:
:<aText>letter 'a' with umlaut: ä</aText>
:
:And when I serialize it, it comes out encoded as ISO-8895-1. But I
:don't think the problem is with serialization. In processing my XML
:files, I'm matching bits and pieces of text and attributes with some
:Unicode/UTF-8 text read in from another souce. When the strings in my
:XML file contain non-ASCII characters, then I have problems.
:
:Hopefully, I've explained the problem enough so that someone can help.
:In case it's necessary, I attach at the end, a bit of code for reading
:in and serializing a DOM.
:
:Dale Gerdemann
:----------------
:import org.xml.sax.InputSource;
:import java.io.FileInputStream;
:import java.io.File;
:import java.io.FileWriter;
:import org.w3c.dom.Document;
:import org.apache.xerces.parsers.DOMParser;
:import org.apache.xerces.dom.DOMImplementationImpl;
:import org.xml.sax.SAXException;
:import org.w3c.dom.DOMException;
:import java.io.IOException;
:import org.w3c.dom.Element;
:import org.apache.xml.serialize.OutputFormat;
:import org.apache.xml.serialize.XMLSerializer;
:import org.apache.xml.serialize.LineSeparator;
:
:
:public class AProblem {
:
: public static void main(String[] args)
: throws DOMException, IOException, SAXException {
:
: DOMParser parser = new DOMParser();
: InputSource is = new InputSource(new FileInputStream(new
:File("foo.xml")));
: is.setEncoding("UTF-8");
: parser.parse(is);
: Document doc = parser.getDocument();
: Element root = doc.getDocumentElement();
: System.out.println(root.getChildNodes().item(0));
:
:
:
: OutputFormat format = new OutputFormat(doc);
: format.setLineSeparator(LineSeparator.Unix);
:
: format.setIndenting(true);
: format.setLineWidth(0);
: format.setPreserveSpace(true);
: format.setEncoding("UTF-8");
: FileWriter fw = new FileWriter("bar.xml");
:
: XMLSerializer serializer = new XMLSerializer(fw, format);
: serializer.serialize(doc);
:
:
: }
:}


I've encountered this problem myself. The solution was to use something
besides a FileWriter to output your new XML document, since you need to
encode both the XML data and the data written to an external file.

Your OutputFormat object specifies that the XML gets UTF-8 encoding, but
the FileWriter will use your system's default encoding. What I use now
is an OutputStreamWriter with its constructor taking an OutputStream (I
use a FileOutputStream) and a String naming the encoding. That solved
the problem for me.

I also note that you're specifying UTF-8 on input. While I doubt it
does any harm, it shouldn't be necessary.

Hope this helps.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama
Jul 20 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.