Incorrect parsing of special characters

Dario Di Bella

Hi all,
I hope someone can help me on this. I need to parse the following XML:

....
<area name="promotore">
<item id="004" code="003" description="attivita promotore">
<![CDATA[» Attività Promotore]]>
</item>
</area>
....

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ* Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

I'm using the javax.xml.parsers.DocumentBuilder parser.

Has anyone got any clue? Thanks.

Dario

Jul 20 '05 #1

Subscribe Post Reply

8143

Bjoern Hoehrmann

* Dario Di Bella wrote in comp.text.xml:

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ* Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

I'm using the javax.xml.parsers.DocumentBuilder parser.

Has anyone got any clue? Thanks.

The output seems to be UTF-8 which you view in some application that
assumes the output is ISO-8859-1 or similar encoded. Could you
elaborate which kind of problem you are trying to solve? Everything
should be fine as long as the second application supports UTF-8 and
knows that the data is UTF-8 encoded.

Jul 20 '05 #2

Dario Di Bella

Bjoern,
Thanks for your reply.
I am building a jsp tag library to build a dynamic javascript menu on
a web page. The javascript code is mm_menu.js shipped into
Dreamweaver. The menu items should be dynamically loaded after the
user login, based on the user permissions (i.e. some menu items will
be enabled, some other won't). The common menu configuration is stored
in an xml file. Each <item> tag represents a menu item. The CDATA
section is the text that will be displayed on the page, hence contains
html specific codes (" ") and some special characters currenty
used in the italian language ("à").

Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.

I don't understand your observation regarding a second application:
even if I parse the xml and echo the nodes content on the system
output (i.e. System.out.println(element.getData());) I obtain the same
wrong text.

Any suggestion?
Thanks and regards.

Dario.

Bjoern Hoehrmann <bj****@hoehrmann.de> wrote in message news:<40****************@news.bjoern.hoehrmann.de> ...

* Dario Di Bella wrote in comp.text.xml:
As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ* Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

I'm using the javax.xml.parsers.DocumentBuilder parser.

Has anyone got any clue? Thanks.

The output seems to be UTF-8 which you view in some application that
assumes the output is ISO-8859-1 or similar encoded. Could you
elaborate which kind of problem you are trying to solve? Everything
should be fine as long as the second application supports UTF-8 and
knows that the data is UTF-8 encoded.

Jul 20 '05 #3

Michael Borgwardt

Dario Di Bella wrote:

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

Does your document correctly declare its encoding? If you specify
none, the default is UTF-8 whereas Windows text editors usually
default to CP1252. Trying to parse CP1252-encoded text as UTF-8
could easily lead to the weirdness you describe.

Jul 20 '05 #4

Thomas Weidenfeller

Dario Di Bella wrote:

<![CDATA[» Attività Promotore]]>
Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

Check your charset encoding. This looks very much as if the encoding in
which the XML comes and the encoding used to read it don't match.

/Thomas

Jul 20 '05 #5

Bjoern Hoehrmann

* Dario Di Bella wrote in comp.text.xml:

Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.

To make that work you need to ensure that the HTML document and the
output of your code use the same encoding. It's all about the bytes.
You could try to copy and paste the following fragment into a HTML
document

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=utf-8">
<title></title>
<p>Â» AttivitÃ* Promotore</p>

and load that into your browser. All your characters should be just
fine. If you change the fragment to

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=iso-8859-1">
<title></title>
<p>Â» AttivitÃ* Promotore</p>

It breaks. So you need to either change the encoding and/or
declaration of the encoding of the surrounding HTML document
or you need to transcode the data or you can try to use character
references.

The java.lang.String object for example provides a getBytes(...)
method, you can do e.g. the following:

class Foo{public static void main(String[] argv){try{

System.out.write("\u00f6".getBytes("UTF-8"));
System.out.println();
System.out.write("\u00f6".getBytes("ISO-8859-1"));
System.out.println();
System.out.write(0x94); /* CP850, but it is not supported... */
System.out.println();

} catch (Exception e) {e.printStackTrace();}}}

Depending on your operating system, locales, etc. one of the writes
will most likely show an "ö". If you work on Windows, on the command
line most likely the last write(), if you redirect the output to a
file (`java Foo > file.txt`) and open file.txt in Notepad, you would
notice that it is now the second write() that shows the "ö". You can
also create a new text file containing "ö" and go to the command line
prompt and type "type C:\...\file.txt" which would then likely show
"÷" not "ö".

HTH...

Jul 20 '05 #6

Dario Di Bella

Bjoern/Michael/Thomas,

I solved this issue declaring a different encoding ("iso-8859-1"
instead of "utf-8"). Thank you very much for your help, and excuse me
for bothering you with a trivial problem ;-)

Best regards.

Dario.

Jul 20 '05 #7

by: Philip Kofoed | last post by:

Greetings, I have a SQL server 2000 running on an english win2000 workstation. In a database I have a table where one varchar column is set to polish collation. Regional settings for the system...

Microsoft SQL Server

url parsing

by: Fabian | last post by:

I use the following to parse the url var srch = window.location.search.substring(1); // then split srch at the ampersand: var parts = srch.split("&"); // write the parameters into the variables...

Javascript

Parsing text into web page table entries?

by: .:mmac:. | last post by:

I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every...

ASP / Active Server Pages

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...

Python

xml file parsing in C

by: Marc Dubois | last post by:

hi, is it possible to parse an XML file in C so that i can fulfill these requirements : 1) replace all "<" and ">" signs inside the body of tag by a space, e.g. : Example 1: <fooblabla < bla...

C / C++

Parsing Special Characters in response xml inside a javascript method

by: haijdp | last post by:

Hi, Can anyone help me on how to handle the special chars in response data from server side.I am using AJAX,I am parsing one large xml file in the javascript, which can have multiple specila chars...

Javascript

Need help in parsing the special characters using XML::Parser

by: rellaboyina | last post by:

Dear All, I am having some data which will be stored in XML format and this needs to be parsed using the parser module XML::Parser and XML::Parser::Expat. This data consists of some special...

Perl

Character Classes and Special Characters

by: KevinADC | last post by:

Purpose The purpose of this article is to discuss the difference between characters inside a character class and outside a character class and some special characters inside a character class....

Perl

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Incorrect parsing of special characters

Similar topics