Foreign Characters in XML

Hugh Janus

Hi all,

I posted a couple of weeks ago with what I thought was a problem with
the file system reading accented characters however, after debugging
line by line I have now found the true problem.

I am storing a list of files in an XML file as a sort of database.
Some of these filenames have accented characters (i.e. á é í ó ú
or ñ). However, upon writing the filename to the XML file, the
accented character is dropped. This causes a problem upon re-reading
the filenames because the program can not find the files because their
'saved' filename is now different. For example, the word "más" is
saved in the XML file as "ms".

Any ideas how I can work around this? I could strip out the accents
and replace them with their "normal" equivalent i.e. á becomes a. But
this is a sort of bodge fix as I will lose the link to the original
file. Also, I can see a scenario where a file may get overwritten
because the modified filename is the same as an existing file perhaps.

So, to put it blunty, I'm stuck! Help!
Thanks

Jan 9 '06 #1

Subscribe Post Reply

3390

Herfried K. Wagner [MVP]

"Hugh Janus" <my*************@hotmail.com> schrieb:

I am storing a list of files in an XML file as a sort of database.
Some of these filenames have accented characters (i.e. á é í ó ú
or ñ). However, upon writing the filename to the XML file, the
accented character is dropped. This causes a problem upon re-reading
the filenames because the program can not find the files because their
'saved' filename is now different. For example, the word "más" is
saved in the XML file as "ms".

How are you currently writing data to the XML file? Which classes are you
using? It's likely that the problem is caused by a wrong encoding used to
persist the data.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Jan 9 '06 #2

Matthew.Gertz

Hi, Hugh,
I'm not sure what you're code looks like, but you may need to "tokenize" (encode) these characters. They should be stored or read in as either UTF-8 or Unicode (XML processors are supposed to recognize these). This should "just work" if you are using the .NET framework's System.Xml code to generate or read an XML document -- you shouldn't have to do anything. Are you generating your own XML instead, & parsing it on your own? If so, you will need to do the encoding yourself, and will need to make sure the file you create has the appropriate header detailing the text type -- it sounds like you're translating them to bare ASCII when you're writing them out. You can use the System.Test.UTF8Encoding class to translate strings between "normal" strings and UTF8, for example.

Let us know how it goes. (I'll be away for a few days, but will check back when I get back Friday.)

--Matt Gertz--*
VB Compiler Dev Lead

-----Original Message-----
From: Hugh Janus
Posted At: Monday, January 09, 2006 11:41 AM
Posted To: microsoft.public.dotnet.languages.vb
Conversation: Foreign Characters in XML
Subject: Foreign Characters in XML
Hi all,

I posted a couple of weeks ago with what I thought was a problem with
the file system reading accented characters however, after debugging
line by line I have now found the true problem.

I am storing a list of files in an XML file as a sort of database.
Some of these filenames have accented characters (i.e. =E1 =E9 =ED =F3 =FA
or =F1). However, upon writing the filename to the XML file, the
accented character is dropped. This causes a problem upon re-reading
the filenames because the program can not find the files because their
'saved' filename is now different. For example, the word "m=E1s" is
saved in the XML file as "ms".

Any ideas how I can work around this? I could strip out the accents
and replace them with their "normal" equivalent i.e. =E1 becomes a. But
this is a sort of bodge fix as I will lose the link to the original
file. Also, I can see a scenario where a file may get overwritten
because the modified filename is the same as an existing file perhaps.

So, to put it blunty, I'm stuck! Help!
Thanks

Jan 9 '06 #3

Hugh Janus

Herfried K. Wagner [MVP] wrote:

How are you currently writing data to the XML file? Which classes are you
using? It's likely that the problem is caused by a wrong encoding used to
persist the data.

I am using the class StreamReader and StreamWriter. Can I specify the
enconding with these in order to have the accented characters?

Jan 10 '06 #4

Carlos J. Quintero [VB MVP]

The StreamReader and StreamWriter classes have overloaded constructors to
specify the encoding:

public StreamReader ( System.String path , System.Text.Encoding encoding )

public StreamWriter ( System.String path , System.Boolean append ,
System.Text.Encoding encoding )
Member of System.IO.StreamWriter

You will have to use the System.Text.Encoding.Default encoding.
--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com
"Hugh Janus" <my*************@hotmail.com> escribió en el mensaje
news:11**********************@g14g2000cwa.googlegr oups.com...

Herfried K. Wagner [MVP] wrote:
How are you currently writing data to the XML file? Which classes are
you
using? It's likely that the problem is caused by a wrong encoding used
to
persist the data.

I am using the class StreamReader and StreamWriter. Can I specify the
enconding with these in order to have the accented characters?

Jan 10 '06 #5

Hugh Janus

Carlos J. Quintero [VB MVP] wrote:

The StreamReader and StreamWriter classes have overloaded constructors to
specify the encoding:

public StreamReader ( System.String path , System.Text.Encoding encoding )

public StreamWriter ( System.String path , System.Boolean append ,
System.Text.Encoding encoding )
Member of System.IO.StreamWriter

You will have to use the System.Text.Encoding.Default encoding.
--

Thanks Carlos for this. I'll give it a try and post back if it fails.

I assume that the System.Text.Encoding.Default will cater for all
accents and the Ñ ?

Jan 10 '06 #6

Carlos J. Quintero [VB MVP]

> I assume that the System.Text.Encoding.Default will cater for all accents

and the Ñ ?

The Default encoding uses your Windows code page instead of Unicode. As long
as your Windows code page (Control Panel, Regional Settings) matches the
code page of the computer used to generate the files, it will work.

--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com

Jan 10 '06 #7

Hugh Janus

Carlos J. Quintero [VB MVP] wrote:

I assume that the System.Text.Encoding.Default will cater for all accents
and the Ñ ?

The Default encoding uses your Windows code page instead of Unicode. As long
as your Windows code page (Control Panel, Regional Settings) matches the
code page of the computer used to generate the files, it will work.

--

Ah, well there is a problem. I am developing on a computer that is set
to Spanish regional settings but the app very possibly could be
installed on a computer with different regional settings. Is there a
universal one I could use?

Jan 10 '06 #8

Herfried K. Wagner [MVP]

"Hugh Janus" <my*************@hotmail.com> schrieb:

Ah, well there is a problem. I am developing on a computer that is set
to Spanish regional settings but the app very possibly could be
installed on a computer with different regional settings. Is there a
universal one I could use?

I'd go with 'Encoding.UTF8' or 'Encoding.Unicode' (which is UTF-16).

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Jan 10 '06 #9

Cor Ligthert [MVP]

Hugh,

Have a look at those code tables for Unicode.

OS systems
http://www.microsoft.com/globaldev/r...ocversion.mspx

As you can see in the last page, are countries where is spoken Western
European languages, standard using code page 1252.

I hope this helps a little bit?

Cor

Jan 10 '06 #10

Carlos J. Quintero [VB MVP]

The code pages are not per country, but for greater regions or alphabets.
For example, western european languages use code page 1252 (ANSI Latin I) if
I remember correctly. So, if you are exchanging data from, say, France to
Spain, it will work. China or Russia would be a problem, though.

Also, if you know the code page that was used to create the file, you can
create your own encoding instead of using "Default":

new System.Text.Encoding(codepage)

and pass it to your reader.

If you want to avoid the code page mess, then the writer and the reader
should use Unicode, which was invented to avoid this kind of problems.
--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com
"Hugh Janus" <my*************@hotmail.com> escribió en el mensaje
news:11**********************@z14g2000cwz.googlegr oups.com...

Ah, well there is a problem. I am developing on a computer that is set
to Spanish regional settings but the app very possibly could be
installed on a computer with different regional settings. Is there a
universal one I could use?

Jan 10 '06 #11

Hugh Janus

> The code pages are not per country, but for greater regions or alphabets.

For example, western european languages use code page 1252 (ANSI Latin I) if
I remember correctly. So, if you are exchanging data from, say, France to
Spain, it will work. China or Russia would be a problem, though.

Also, if you know the code page that was used to create the file, you can
create your own encoding instead of using "Default":

new System.Text.Encoding(codepage)

and pass it to your reader.

If you want to avoid the code page mess, then the writer and the reader
should use Unicode, which was invented to avoid this kind of problems.

:-O Carlos, I am impressed! here, have another MVP!

I think my safest option is to use unicode as China is one of the
markets that might be targeted in the future.

This raises one other question. If unicode was invented to avoid all
this, then what is the benefit of NOT using unicode?

Jan 10 '06 #12

Carlos J. Quintero [VB MVP]

"Hugh Janus" <my*************@hotmail.com> escribió en el mensaje
news:11*********************@g44g2000cwa.googlegro ups.com...

I think my safest option is to use unicode as China is one of the
markets that might be targeted in the future.
Yes, the safest is to use Unicode.
This raises one other question. If unicode was invented to avoid all
this, then what is the benefit of NOT using unicode?

Unicode has the drawback that it increases the size of file since it uses 2
bytes per character, compared to 1 byte per character when using code pages.
It is the price to pay to accommodate all the characters of all
alphabets.... So, NOT using unicode has the benefit of using smaller files.

--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com

Jan 10 '06 #13

Carlos J. Quintero [VB MVP]

You may also enjoy this article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com

Jan 10 '06 #14

Hugh Janus

The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Carlos, this is superb. Thanks. However, when I read the filenames in
via StreamReader, I add them into a hashtable. Some of the filenames
are getting added to the hashtable as just
"?????????????????????????????" which when written back via
StreamWriter become what looks like chinese characters.

Any ideas?

p.s. I have specified the same enconding for both writer and reader.

Jan 10 '06 #15

Hugh Janus

don't worry, i solved it. it was a typo.

Jan 11 '06 #16

Carlos J. Quintero [VB MVP]

One last thing: the 2 bytes per character for storage that I said is only
when you save as Unicode UTF-16, saving in UTF-8 consumes less space.
--

Best regards,

Carlos J. Quintero

MZ-Tools: Productivity add-ins for Visual Studio 2005, Visual Studio .NET,
VB6, VB5 and VBA
You can code, design and document much faster in VB.NET, C#, C++ or VJ#
Free resources for add-in developers:
http://www.mztools.com
"Hugh Janus" <my*************@hotmail.com> escribió en el mensaje
news:11**********************@o13g2000cwo.googlegr oups.com...

don't worry, i solved it. it was a typo.

Jan 11 '06 #17

by: Toffe | last post by:

Hi, I've got a problem with regular expressions and strings containing Swedish characters (åäö). I basically have a PHP script that highlights certain words in a text. I found the code...

PHP

Foreign Characters

by: JJY | last post by:

Hi. I have an ASP page. In there, I retirieve some data (Korean characters for testing purpose) from the database and I tried to display it. If I set the session.codepage = 949, the foreign...

ASP / Active Server Pages

Bringing foreign letters to textarea

by: dalei | last post by:

I like to make foreign letters to appear in the textarea. For instance, when typing the letter 'a' on the keyboard, the Japanese letter &#+12449; would appear in the textarea. Could somebody...

Javascript

Xah's edu corner: the Journey of Foreign Characters thru Internet

by: Xah Lee | last post by:

the Journey of Foreign Characters thru Internet Xah Lee, 20051101 There's a bunch of confusions about the display of non-ascii characters such as the bullet "â€¢". These confusions are...

Python

foreign character encoding

by: Harley | last post by:

im working on an ASP.NET app in VB.NET and have problems with foreign characters. everything works ok, special characters are presented ok onscreen when typed in the body of the page, using html...

ASP.NET

validateRequest and foreign characters

by: =?Utf-8?B?R2VyaGFyZA==?= | last post by:

I get an error on a .net 2.0 page when I use foreign characters, such as Ã§ or Ã£. Setting validateRequest=false handles this, but is there a way to keep validateRequest=true but allow foreign...

.NET Framework

Error loading a stylesheet with foreign chars via the MSXML2.ServerXMLHTTP object

by: niklang | last post by:

Hi everybody, I have an ASP page that uses the MSXML2.ServerXMLHTTP object to read a stylesheet from IIS as follows: strXSLPath = "http://localhost/ej/ejdetail.xsl.asp" ...

XML

Foreign Character Handling

by: MitchellEr | last post by:

I can't seem to get consistency in my application with foreign character handling. I'm creating a series of forms that update database tables. So, when trying to edit a form, the field values that...

ASP / Active Server Pages

Searching foreign characters - Classic ASP & SQL 2005

by: Matt | last post by:

I originally posted this in microsoft.public.sqlserver.server, and it was suggested that I post here. I'm having problems with searches via a classic ASP front-end of terms including foreign...

ASP / Active Server Pages

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Similar topics