473,749 Members | 2,665 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to know the encoding of XML file?

Hi All,

I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.

I dont have much idea about the encoding technolgy also.

Is there any way to identify the encoding type of XML file
programtically; I can use Xerces C++ library or any other free library
to identify the correct encoding. Any other work around is also
welcome.

Thanks & Regards

Sep 13 '05 #1
8 2495


da*********@pos tmark.net wrote:
I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.


Well for XML there are clear rules, if there is no XML declaration
specifying the encoding then it can only be UTF-8 or UTF-16 encoded and
that is something you can decide with the BOM respectively the existance
of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional).
So look at the BOM and the XML declaration (that <?xml
version="versio n.number" encoding="encod ing-is-here"?>) to find the
encoding for XML:
<http://www.w3.org/TR/REC-xml/#charencoding>
Of course what you really do with the above is detect the encoding the
XML document is supposed to be in and an XML parser then has to check
the whole document to comply with that encoding, e.g. if you read the
XML declaration saying encoding="ISO-8859-1" that means the XML is
supposed to be in that encoding and a parser then checks whether any
byte sequences are encountered which can't be decoded properly using
that encoding.

In general there needs to be a declaration of the encoding associated
with a document (e.g. in XML in the XML declaration, in HTML in a <meta>
element, or for resources accessed via HTTP in the response header) as
there is no general algorithm to detect any encoding that exists. For
instance you can not detect whether a document is meant to be ISO-8859-1
encoded or ISO-8859-15 encoded, the document author has to declare the
encoding, the same bytes are just interpreted as different characters.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Sep 13 '05 #2
In <11************ *********@g47g2 000cwa.googlegr oups.com>, on
09/13/2005
at 04:01 AM, da*********@pos tmark.net said:
Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting.


Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
bytes, such as UTF-16. UTF-8 uses 8-bit bytes.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@librar y.lspace.org

Sep 13 '05 #3
Shmuel (Seymour J.) Metz escribió:
In <11************ *********@g47g2 000cwa.googlegr oups.com>, on
09/13/2005
at 04:01 AM, da*********@pos tmark.net said:
Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting.


Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
bytes, such as UTF-16. UTF-8 uses 8-bit bytes.


In mixed Unicode/non-unicode environments the BOM helps to discriminate
between Unicode/UTF-8 files and simpler ASCII/ISO-8859-x/... text files.

--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
Sep 13 '05 #4
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:
Why would any UTF-8 file have a BOM?
FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
That's for encodings with 16-bit bytes, such as UTF-16.


Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
units (I'd avoid using the term "bytes"), but don't need a BOM,
because their endian-ness is specified by the name of the encoding
scheme.

Sep 13 '05 #5
Alan J. Flavell (fl*****@ph.gla .ac.uk) wrote:
: On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:

: > Why would any UTF-8 file have a BOM?

: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

: > That's for encodings with 16-bit bytes, such as UTF-16.

: Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
: units (I'd avoid using the term "bytes"), but don't need a BOM,
: because their endian-ness is specified by the name of the encoding
: scheme.

utf-16BE and utf-16LE must be using 8 bit bytes, because if they were
using true 16-bit code units then there would be no endian-ness to
consider.

(I'm still waiting for hardware that increases character sizes. They've
done it for all other elementary units on the computer, integers, memory
pointers, etc, but for some reason not this one.)
--

This programmer available for rent.
Sep 13 '05 #6
In <43********@new s.victoria.tc.c a>, on 09/13/2005
at 09:51 AM, yf***@vtn1.vict oria.tc.ca (Malcolm Dew-Jones) said:
: > Why would any UTF-8 file have a BOM?
: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
Note that the file doesn't contain a BOM, but rather the UTF-8
encoding of a BOM. An actual BOM would not be valid UTF-8.
(I'm still waiting for hardware that increases character sizes.


For most hardware, character size is irrelevant. Some devices deal
with large blocks of data. Some deal with graphical data rather than
text. Some deal with individual bits. Keyboards deal with scan codes
rather than conventional character representations . The only common PC
peripherals that I can think of that actually deal with characters as
characters are a display adapter or printer in text mode, and those
are essentially obsolete.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@librar y.lspace.org

Sep 13 '05 #7
On Tue, 13 Sep 2005, Malcolm Dew-Jones wrote:
utf-16BE and utf-16LE must be using 8 bit bytes,
That's the distinction (as set out in recent Unicode terminologies)
between the Character Encoding Form (which in all these three cases is
designated utf-16, consisting of 16-bit code units), and its Character
Encoding Schemes (of which there are the three: utf-16 with BOM,
utf-16LE, and utf-16BE) for representing the 16-bit code units as an
octet stream.

See chapter 2, sections 2.5 and 2.6 , e.g
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
as well as the previously-cited FAQs
because if they were using true 16-bit code units then there would
be no endian-ness to consider.
It's unfortunate that when one reads "utf-16", without context, it is
unclear whether it's meant to refer to the C.E.F (and thus to comprise
all three C.E.Ses), or only to the one C.E.S. Perhaps it's a pity
they didn't devise different designations for the CEF and for the CES
(maybe "utf-16BOM" for the CES).

(This isn't a problem for utf-8, since there is only one CES for
that particular CEF, with the BOM being optional.)
(I'm still waiting for hardware that increases character sizes.
Historically, there has been at least one machine with 36-bit words
that could be used as four 9-bit units; but that's past rather than
future!
They've done it for all other elementary units on the computer,
integers, memory pointers, etc, but for some reason not this one.)


I suspect you're more interested in raising it to 16 bits (or 32) than
to some non-multiple of 8, though.

best
Sep 13 '05 #8
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:
: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
Note that the file doesn't contain a BOM, but rather the UTF-8
encoding of a BOM.


*No* data stream ever literally "contains" a BOM, any more than it
"contains" a copyright sign, or the letter "A" (the BOM, just like any
Unicode character, is an abstract concept): what a data stream
contains is the BOM encoded according to the appropriate "Character
Encoding Scheme". That's the whole point of the BOM, so that the
character encoding scheme can be recognised by inspecting the
encoding. So there were no surprises there.
An actual BOM would not be valid UTF-8.


An "actual BOM" is an abstract concept!

The idea of dumping the hexadecimal number x'FEFF' into a utf-8 data
stream - if that was what you had in mind - would make no sense, any
more than dumping x'00A9' into it would make any sense to represent
the copyright sign. Isn't that obvious?

Let's cut them some slack: when they say that it "contains a BOM",
they are taking it for granted that it means "appropriat ely encoded".
You can't put an abstract concept into a data stream *without* an
appropriate encoding, after all.
Sep 13 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
7604
by: CMan | last post by:
Hi, I am reading a text file using a StreamReader in C# but the reader is unable to handle some of the characheters. Using the default encoding the program cannot handle accented characters. I tried opening the file using other encodings e.g. UTF7. UTF7 fixed the accents but cannot hadle the plus sign x2B. I am also having problems with the Euro symbol x80 and quote x92.
3
5884
by: xmlguy | last post by:
XmlTextReader myXmlReader = new XmlTextReader(args); string en = myXmlReader.Encoding.EncodingName; //Console.WriteLine(x); Error: Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
4
11541
by: H Lee | last post by:
Hi, I'm an XML newbie, and not sure if this is the appropriate newsgroup to post my question, so feel free to suggest other newgroups where I should post this message if this is the case. I'm having issues using XmlTextWriter, saving it out to a file with UTF8 encoding, and seeing "dirty", or "human unreadable" characters show up *right before* the XML declaration. I need to have the XML declaration state "encoding = utf-8", but also...
6
18764
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><a>Schönbühl</a>"; doc.LoadXml(s); doc.Save("d:\\temp\\test.xml");
8
7581
by: Xarky | last post by:
Hi, I am downloading a GIF file(as a mail attachement) with this file format, Content-Transfer-Encoding: base64; Now I am writing the downloaded data to a file with this technique: streamWriter = new StreamWriter(@startupPath+"\\"+filename, false); streamWriter.WriteLine(data); I am not specifying any file Encoding. When I try to open the file
4
19014
by: Nick | last post by:
Hi, I am trying to output a string of chinese characters as a text file. When I open a file for writing from VB, the file is automatically set to UTF-8 encoding (can tell by opening the file from notepad). However, when I open this file from a Chinese program that does not support unicode, garbage is displayed. So what I have to do is to first use Notepad to change the encoding of the file to ANSI encoding, then the file would be...
15
5424
by: Steven Bethard | last post by:
I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with encodings, so I'm sure I'm just screwing something simple up. Can anyone help me? Here's the interactive session. Sorry it's a little verbose, but I figured it would be better to include too much than not enough. I basically expected...
1
32935
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ###Question: Now this creates finalfile.txt with ANSI Encoding ...which is a default. Either tell me how to change the default or how to create a
14
5770
by: Zoro | last post by:
My task is to read html files from disk and save them onto SQL Server database field. I have created an nvarchar(max) field to hold them. The problem is that some characters, particularly html entities, and French/German special characters are lost and/or replaced by a question mark. This is really frustrating. I have tried using StreamReader with ALL the encodings available and none work correctly. Each encoding handles some characters...
0
8997
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8833
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9389
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9335
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9256
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6801
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4709
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4881
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2794
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.