473,508 Members | 2,744 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF-8 encoding problem

Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth

Oct 18 '06 #1
4 2339
Shreshth,
Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.
What does the second file have in its XML declaration (what specifically
does its declaration look like)?

Sounds like you have a bug in the application that wrote the second Xml
file.

I suspect (hope) when that application created the Xml (the XmlWriter) it
encoded the characters per what the Xml declaration states. I would then
expect (but not hope) when it (the underlying text writer) wrote the file,
it "transposed" (read mangled) the correctly encoded characters into UTF-8.
I consider this double transposition to be bad, very bad.

--
Hope this helps
Jay B. Harlow
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth
Oct 18 '06 #2
By xml declaration at the beginning of the file,i mean to say the XML
Declaration having the "encoding" attribute at the begining of file
(Encoding = UTF-8, do not remeber the exact format). It is the same as
MSDN says.

Do you still mean to say the same in that case as well.
Actually i am not not able to understand completely what exact you want
to say.

By the way, XML write here is Notepad.

Thanks for your reply.

Shreshth

Jay B. Harlow wrote:
Shreshth,
Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.
What does the second file have in its XML declaration (what specifically
does its declaration look like)?

Sounds like you have a bug in the application that wrote the second Xml
file.

I suspect (hope) when that application created the Xml (the XmlWriter) it
encoded the characters per what the Xml declaration states. I would then
expect (but not hope) when it (the underlying text writer) wrote the file,
it "transposed" (read mangled) the correctly encoded characters into UTF-8.
I consider this double transposition to be bad, very bad.

--
Hope this helps
Jay B. Harlow
.NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth
Oct 18 '06 #3
Shreshth
By xml declaration at the beginning of the file,i mean to say the XML
Declaration having the "encoding" attribute at the begining of file
(Encoding = UTF-8, do not remeber the exact format). It is the same as
MSDN says.
Yes, but what specifically does your file say (cut & paste the one from your
file into your response to this message)... Alternatively email them to me.
By the way, XML write here is Notepad.
Ah! There's the rub!

What I am saying is the "encoding" of your physical file (the one on disk)
is different then the logical file (the xml itself). (My example may have
been backwards, but the net effect is the same, the characters are not
encoded to what you think they are).

It sounds like your physical file is UTF-8, while I'm concerned your logical
file is whatever, where whatever is the text you blindly copied from an MSDN
article.
--
Hope this helps
Jay B. Harlow
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11*********************@m7g2000cwm.googlegrou ps.com...
By xml declaration at the beginning of the file,i mean to say the XML
Declaration having the "encoding" attribute at the begining of file
(Encoding = UTF-8, do not remeber the exact format). It is the same as
MSDN says.

Do you still mean to say the same in that case as well.
Actually i am not not able to understand completely what exact you want
to say.

By the way, XML write here is Notepad.

Thanks for your reply.

Jay B. Harlow wrote:
>Shreshth,
Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.
What does the second file have in its XML declaration (what specifically
does its declaration look like)?

Sounds like you have a bug in the application that wrote the second Xml
file.

I suspect (hope) when that application created the Xml (the XmlWriter) it
encoded the characters per what the Xml declaration states. I would then
expect (but not hope) when it (the underlying text writer) wrote the
file,
it "transposed" (read mangled) the correctly encoded characters into
UTF-8.
I consider this double transposition to be bad, very bad.

--
Hope this helps
Jay B. Harlow
.NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googleg roups.com...
Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth
Oct 18 '06 #4
Hi Jay,

<?xml version="1.0" encoding="UTF-8" ?>
This is the XML Declaration i was speaking about.

Rest of the file is the same as normal XML file.

I will try what you have told me in the office tomorrow but one thing i
can tell you right now is that I have already tried the same file
(having only BoM and not XML declaration)
by saving it in UTF-16 LE and UTF-16 BE.

And my third party desktop search works with both of them.

Only problem is with the UTF-8 format.
Thanks.

Shreshth

Jay B. Harlow wrote:
Shreshth
By xml declaration at the beginning of the file,i mean to say the XML
Declaration having the "encoding" attribute at the begining of file
(Encoding = UTF-8, do not remeber the exact format). It is the same as
MSDN says.
Yes, but what specifically does your file say (cut & paste the one from your
file into your response to this message)... Alternatively email them to me.
By the way, XML write here is Notepad.
Ah! There's the rub!

What I am saying is the "encoding" of your physical file (the one on disk)
is different then the logical file (the xml itself). (My example may have
been backwards, but the net effect is the same, the characters are not
encoded to what you think they are).

It sounds like your physical file is UTF-8, while I'm concerned your logical
file is whatever, where whatever is the text you blindly copied from an MSDN
article.
--
Hope this helps
Jay B. Harlow
.NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11*********************@m7g2000cwm.googlegrou ps.com...
By xml declaration at the beginning of the file,i mean to say the XML
Declaration having the "encoding" attribute at the begining of file
(Encoding = UTF-8, do not remeber the exact format). It is the same as
MSDN says.

Do you still mean to say the same in that case as well.
Actually i am not not able to understand completely what exact you want
to say.

By the way, XML write here is Notepad.

Thanks for your reply.

Jay B. Harlow wrote:
Shreshth,
Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.
What does the second file have in its XML declaration (what specifically
does its declaration look like)?

Sounds like you have a bug in the application that wrote the second Xml
file.

I suspect (hope) when that application created the Xml (the XmlWriter) it
encoded the characters per what the Xml declaration states. I would then
expect (but not hope) when it (the underlying text writer) wrote the
file,
it "transposed" (read mangled) the correctly encoded characters into
UTF-8.
I consider this double transposition to be bad, very bad.

--
Hope this helps
Jay B. Harlow
.NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
<sh*************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth
Oct 18 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

38
5695
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
16
6141
by: lawrence | last post by:
I was told in another newsgroup (about XML, I was wondering how to control user input) that most modern browsers empower the designer to cast the user created input to a particular character...
22
11917
by: Martin Trautmann | last post by:
Hi all, is there any kind of 'hiconv' or other (unix-like) conversion tool that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)? The database output is UTF-8 or UTF-16 only - Thus almost...
32
49649
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....
1
15547
by: stevelooking41 | last post by:
Can someone explain why I don't seem unable to use document.write to produce a valid UTF-8 none breaking space sequence (Hex: C2A0) ? I've tried everyway I've been able to find to tell the...
7
3070
by: saroj.yadav | last post by:
As I understand it (correct me, if I am wrong) Unicode came into picture so that a document containing multiple language characters can be supported like somebody can write a document comparing...
1
6143
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...
23
8164
by: Steven T. Hatton | last post by:
This is one of the first obstacles I encountered when getting started with C++. I found that everybody had their own idea of what a string is. There was std::string, QString, xercesc::XMLString,...
7
12077
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
8
7014
by: Siegfried Heintze | last post by:
The following perl program works when I run it from urxvt-X console on cygwin-x windows LC_CTYPE=en_US.UTF-8 urxvt-X.exe& perl -wle "binmode STDOUT, q; print chr() for 0x410 .. 0x430;" This...
0
7128
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7332
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7393
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
7058
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
4715
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3206
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3191
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1565
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
769
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.