473,544 Members | 2,458 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Incorrect parsing of special characters

Hi all,
I hope someone can help me on this. I need to parse the following XML:

....
<area name="promotore ">
<item id="004" code="003" description="at tivita promotore">
<![CDATA[╗&nbsp;AttivitÓ &nbsp;Promot ore]]>
</item>
</area>
....

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

┬╗&nbsp;Attivit ├*&nbsp;Promoto re

i.e. the "┬" character is inserted at the beginning of the string and
the "Ó" character is translated into "├ ".

I'm using the javax.xml.parse rs.DocumentBuil der parser.

Has anyone got any clue? Thanks.

Dario
Jul 20 '05 #1
6 8154
* Dario Di Bella wrote in comp.text.xml:
As you can see I used the CDATA section to include special characters.
Unfortunatel y as I parse the file, the "item" element content turns to
be:

┬╗&nbsp;Attivi t├*&nbsp;Promot ore

i.e. the "┬" character is inserted at the beginning of the string and
the "Ó" character is translated into "├ ".

I'm using the javax.xml.parse rs.DocumentBuil der parser.

Has anyone got any clue? Thanks.


The output seems to be UTF-8 which you view in some application that
assumes the output is ISO-8859-1 or similar encoded. Could you
elaborate which kind of problem you are trying to solve? Everything
should be fine as long as the second application supports UTF-8 and
knows that the data is UTF-8 encoded.
Jul 20 '05 #2
Bjoern,
Thanks for your reply.
I am building a jsp tag library to build a dynamic javascript menu on
a web page. The javascript code is mm_menu.js shipped into
Dreamweaver. The menu items should be dynamically loaded after the
user login, based on the user permissions (i.e. some menu items will
be enabled, some other won't). The common menu configuration is stored
in an xml file. Each <item> tag represents a menu item. The CDATA
section is the text that will be displayed on the page, hence contains
html specific codes ("&nbsp;") and some special characters currenty
used in the italian language ("Ó").

Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.

I don't understand your observation regarding a second application:
even if I parse the xml and echo the nodes content on the system
output (i.e. System.out.prin tln(element.get Data());) I obtain the same
wrong text.

Any suggestion?
Thanks and regards.

Dario.

Bjoern Hoehrmann <bj****@hoehrma nn.de> wrote in message news:<40******* *********@news. bjoern.hoehrman n.de>...
* Dario Di Bella wrote in comp.text.xml:
As you can see I used the CDATA section to include special characters.
Unfortunatel y as I parse the file, the "item" element content turns to
be:

┬╗&nbsp;Attivi t├*&nbsp;Promot ore

i.e. the "┬" character is inserted at the beginning of the string and
the "Ó" character is translated into "├ ".

I'm using the javax.xml.parse rs.DocumentBuil der parser.

Has anyone got any clue? Thanks.


The output seems to be UTF-8 which you view in some application that
assumes the output is ISO-8859-1 or similar encoded. Could you
elaborate which kind of problem you are trying to solve? Everything
should be fine as long as the second application supports UTF-8 and
knows that the data is UTF-8 encoded.

Jul 20 '05 #3
Dario Di Bella wrote:
As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

┬╗&nbsp;Attivit ├ &nbsp;Promot ore

i.e. the "┬" character is inserted at the beginning of the string and
the "Ó" character is translated into "├ ".


Does your document correctly declare its encoding? If you specify
none, the default is UTF-8 whereas Windows text editors usually
default to CP1252. Trying to parse CP1252-encoded text as UTF-8
could easily lead to the weirdness you describe.
Jul 20 '05 #4
Dario Di Bella wrote:
<![CDATA[╗&nbsp;AttivitÓ &nbsp;Promot ore]]>
┬╗&nbsp;Attivit ├ &nbsp;Promot ore

i.e. the "┬" character is inserted at the beginning of the string and
the "Ó" character is translated into "├ ".


Check your charset encoding. This looks very much as if the encoding in
which the XML comes and the encoding used to read it don't match.

/Thomas
Jul 20 '05 #5
* Dario Di Bella wrote in comp.text.xml:
Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.


To make that work you need to ensure that the HTML document and the
output of your code use the same encoding. It's all about the bytes.
You could try to copy and paste the following fragment into a HTML
document

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=ut f-8">
<title></title>
<p>┬╗&nbsp;Atti vit├*&nbsp;Prom otore</p>

and load that into your browser. All your characters should be just
fine. If you change the fragment to

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=is o-8859-1">
<title></title>
<p>┬╗&nbsp;Atti vit├*&nbsp;Prom otore</p>

It breaks. So you need to either change the encoding and/or
declaration of the encoding of the surrounding HTML document
or you need to transcode the data or you can try to use character
references.

The java.lang.Strin g object for example provides a getBytes(...)
method, you can do e.g. the following:

class Foo{public static void main(String[] argv){try{

System.out.writ e("\u00f6".getB ytes("UTF-8"));
System.out.prin tln();
System.out.writ e("\u00f6".getB ytes("ISO-8859-1"));
System.out.prin tln();
System.out.writ e(0x94); /* CP850, but it is not supported... */
System.out.prin tln();

} catch (Exception e) {e.printStackTr ace();}}}

Depending on your operating system, locales, etc. one of the writes
will most likely show an "÷". If you work on Windows, on the command
line most likely the last write(), if you redirect the output to a
file (`java Foo > file.txt`) and open file.txt in Notepad, you would
notice that it is now the second write() that shows the "÷". You can
also create a new text file containing "÷" and go to the command line
prompt and type "type C:\...\file.txt " which would then likely show
"¸" not "÷".

HTH...
Jul 20 '05 #6
Bjoern/Michael/Thomas,

I solved this issue declaring a different encoding ("iso-8859-1"
instead of "utf-8"). Thank you very much for your help, and excuse me
for bothering you with a trivial problem ;-)

Best regards.

Dario.
Jul 20 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
8774
by: Philip Kofoed | last post by:
Greetings, I have a SQL server 2000 running on an english win2000 workstation. In a database I have a table where one varchar column is set to polish collation. Regional settings for the system is polish. Data entered in a client application looks fine until they are posted. When reading the data with the client application, the special...
28
2802
by: Fabian | last post by:
I use the following to parse the url var srch = window.location.search.substring(1); // then split srch at the ampersand: var parts = srch.split("&"); // write the parameters into the variables for(var i in parts) { var temp = parts.split("="); if (temp == "xx") { xx = 1 * temp; } if (temp == "yy") { yy = 1 * temp; }
35
3228
by: .:mmac:. | last post by:
I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every week and are named XX-XX-XX.ASX where the X's represent the date i.e. 05-22-05.asx The files are a specific format and will always contain tags like...
17
2765
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions might be a neat way to solve this, but I am new to them. Can anyone give me a hint here? The catch is, it must only find tokens that are not quoted...
9
4046
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics" portions of Babe Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and store that info in a CSV file. Also, I would like to do this for...
24
2458
by: Marc Dubois | last post by:
hi, is it possible to parse an XML file in C so that i can fulfill these requirements : 1) replace all "<" and ">" signs inside the body of tag by a space, e.g. : Example 1: <fooblabla < bla </foo> becomes <fooblabla bla </foo>
1
2047
by: haijdp | last post by:
Hi, Can anyone help me on how to handle the special chars in response data from server side.I am using AJAX,I am parsing one large xml file in the javascript, which can have multiple specila chars at any place. Thanks in Advance....
6
4413
by: rellaboyina | last post by:
Dear All, I am having some data which will be stored in XML format and this needs to be parsed using the parser module XML::Parser and XML::Parser::Expat. This data consists of some special characters like "°, ß, Ý, Ú, ╚, ×, ¨, ř". But when I try to parse the particular record with these special characters using the method parse(), I...
3
10180
KevinADC
by: KevinADC | last post by:
Purpose The purpose of this article is to discuss the difference between characters inside a character class and outside a character class and some special characters inside a character class. This is not a regular expression tutorial. Assumes you are already familiar with basic regular expression concepts and terminology. If not, you may want...
0
7781
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7389
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7717
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5928
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectŚplanning, coding, testing, and deploymentŚwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5306
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph DuprÚ who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
4930
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3427
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3421
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1848
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.