Hi All,
I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.
I dont have much idea about the encoding technolgy also.
Is there any way to identify the encoding type of XML file
programtically; I can use Xerces C++ library or any other free library
to identify the correct encoding. Any other work around is also
welcome.
Thanks & Regards 8 2449 da*********@postmark.net wrote: I'm newbie to this XML world. My problem is to identify the encoding type of XML at runtime. What currently I'm doing is checking whether BOM is available in the XML; based on the BOM I'm identifying the encoding type. Here is the problem, some type of UTF-8 encoded file does'nt have BOM in the starting. So I'm identying the file as iso-8859-1 encoded which is actually encoded in UTF-8.
Well for XML there are clear rules, if there is no XML declaration
specifying the encoding then it can only be UTF-8 or UTF-16 encoded and
that is something you can decide with the BOM respectively the existance
of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional).
So look at the BOM and the XML declaration (that <?xml
version="version.number" encoding="encoding-is-here"?>) to find the
encoding for XML:
<http://www.w3.org/TR/REC-xml/#charencoding>
Of course what you really do with the above is detect the encoding the
XML document is supposed to be in and an XML parser then has to check
the whole document to comply with that encoding, e.g. if you read the
XML declaration saying encoding="ISO-8859-1" that means the XML is
supposed to be in that encoding and a parser then checks whether any
byte sequences are encountered which can't be decoded properly using
that encoding.
In general there needs to be a declaration of the encoding associated
with a document (e.g. in XML in the XML declaration, in HTML in a <meta>
element, or for resources accessed via HTTP in the response header) as
there is no general algorithm to detect any encoding that exists. For
instance you can not detect whether a document is meant to be ISO-8859-1
encoded or ISO-8859-15 encoded, the document author has to declare the
encoding, the same bytes are just interpreted as different characters.
--
Martin Honnen http://JavaScript.FAQTs.com/
In <11*********************@g47g2000cwa.googlegroups. com>, on
09/13/2005
at 04:01 AM, da*********@postmark.net said: Here is the problem, some type of UTF-8 encoded file does'nt have BOM in the starting.
Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
bytes, such as UTF-16. UTF-8 uses 8-bit bytes.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org
Shmuel (Seymour J.) Metz escribió: In <11*********************@g47g2000cwa.googlegroups. com>, on 09/13/2005 at 04:01 AM, da*********@postmark.net said:
Here is the problem, some type of UTF-8 encoded file does'nt have BOM in the starting.
Why would any UTF-8 file have a BOM? That's for encodings with 16-bit bytes, such as UTF-16. UTF-8 uses 8-bit bytes.
In mixed Unicode/non-unicode environments the BOM helps to discriminate
between Unicode/UTF-8 files and simpler ASCII/ISO-8859-x/... text files.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote: Why would any UTF-8 file have a BOM?
FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
That's for encodings with 16-bit bytes, such as UTF-16.
Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
units (I'd avoid using the term "bytes"), but don't need a BOM,
because their endian-ness is specified by the name of the encoding
scheme.
Alan J. Flavell (fl*****@ph.gla.ac.uk) wrote:
: On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:
: > Why would any UTF-8 file have a BOM?
: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
: > That's for encodings with 16-bit bytes, such as UTF-16.
: Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
: units (I'd avoid using the term "bytes"), but don't need a BOM,
: because their endian-ness is specified by the name of the encoding
: scheme.
utf-16BE and utf-16LE must be using 8 bit bytes, because if they were
using true 16-bit code units then there would be no endian-ness to
consider.
(I'm still waiting for hardware that increases character sizes. They've
done it for all other elementary units on the computer, integers, memory
pointers, etc, but for some reason not this one.)
--
This programmer available for rent.
In <43********@news.victoria.tc.ca>, on 09/13/2005
at 09:51 AM, yf***@vtn1.victoria.tc.ca (Malcolm Dew-Jones) said: : > Why would any UTF-8 file have a BOM? : FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29
Note that the file doesn't contain a BOM, but rather the UTF-8
encoding of a BOM. An actual BOM would not be valid UTF-8.
(I'm still waiting for hardware that increases character sizes.
For most hardware, character size is irrelevant. Some devices deal
with large blocks of data. Some deal with graphical data rather than
text. Some deal with individual bits. Keyboards deal with scan codes
rather than conventional character representations. The only common PC
peripherals that I can think of that actually deal with characters as
characters are a display adapter or printer in text mode, and those
are essentially obsolete.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org
On Tue, 13 Sep 2005, Malcolm Dew-Jones wrote: utf-16BE and utf-16LE must be using 8 bit bytes,
That's the distinction (as set out in recent Unicode terminologies)
between the Character Encoding Form (which in all these three cases is
designated utf-16, consisting of 16-bit code units), and its Character
Encoding Schemes (of which there are the three: utf-16 with BOM,
utf-16LE, and utf-16BE) for representing the 16-bit code units as an
octet stream.
See chapter 2, sections 2.5 and 2.6 , e.g http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
as well as the previously-cited FAQs
because if they were using true 16-bit code units then there would be no endian-ness to consider.
It's unfortunate that when one reads "utf-16", without context, it is
unclear whether it's meant to refer to the C.E.F (and thus to comprise
all three C.E.Ses), or only to the one C.E.S. Perhaps it's a pity
they didn't devise different designations for the CEF and for the CES
(maybe "utf-16BOM" for the CES).
(This isn't a problem for utf-8, since there is only one CES for
that particular CEF, with the BOM being optional.)
(I'm still waiting for hardware that increases character sizes.
Historically, there has been at least one machine with 36-bit words
that could be used as four 9-bit units; but that's past rather than
future!
They've done it for all other elementary units on the computer, integers, memory pointers, etc, but for some reason not this one.)
I suspect you're more interested in raising it to 16 bits (or 32) than
to some non-multiple of 8, though.
best
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote: : FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29 Note that the file doesn't contain a BOM, but rather the UTF-8 encoding of a BOM.
*No* data stream ever literally "contains" a BOM, any more than it
"contains" a copyright sign, or the letter "A" (the BOM, just like any
Unicode character, is an abstract concept): what a data stream
contains is the BOM encoded according to the appropriate "Character
Encoding Scheme". That's the whole point of the BOM, so that the
character encoding scheme can be recognised by inspecting the
encoding. So there were no surprises there.
An actual BOM would not be valid UTF-8.
An "actual BOM" is an abstract concept!
The idea of dumping the hexadecimal number x'FEFF' into a utf-8 data
stream - if that was what you had in mind - would make no sense, any
more than dumping x'00A9' into it would make any sense to represent
the copyright sign. Isn't that obvious?
Let's cut them some slack: when they say that it "contains a BOM",
they are taking it for granted that it means "appropriately encoded".
You can't put an abstract concept into a data stream *without* an
appropriate encoding, after all. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: CMan |
last post by:
Hi,
I am reading a text file using a StreamReader in C# but the reader is unable
to handle some of the characheters.
Using the default encoding the program cannot handle accented characters. I...
|
by: xmlguy |
last post by:
XmlTextReader myXmlReader = new XmlTextReader(args);
string en = myXmlReader.Encoding.EncodingName;
//Console.WriteLine(x);
Error:
Unhandled Exception: System.NullReferenceException: Object...
|
by: H Lee |
last post by:
Hi,
I'm an XML newbie, and not sure if this is the appropriate newsgroup to post
my question, so feel free to suggest other newgroups where I should post
this message if this is the case.
I'm...
|
by: jmgonet |
last post by:
Hello everybody,
I'm having troubles loading a Xml string encoded in UTF-8.
If I try this code:
------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml...
|
by: Xarky |
last post by:
Hi,
I am downloading a GIF file(as a mail attachement) with this file
format, Content-Transfer-Encoding: base64;
Now I am writing the downloaded data to a file with this technique:
...
|
by: Nick |
last post by:
Hi,
I am trying to output a string of chinese characters as a
text file. When I open a file for writing from VB, the
file is automatically set to UTF-8 encoding (can tell by
opening the file...
|
by: Steven Bethard |
last post by:
I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with...
|
by: ujjwaltrivedi |
last post by:
Hey guys,
Can anyone tell me how to create a text file with Unicode Encoding. In
am using
FileStream Finalfile = new FileStream("finalfile.txt",
FileMode.Append, FileAccess.Write);
...
|
by: Zoro |
last post by:
My task is to read html files from disk and save them onto SQL Server
database field. I have created an nvarchar(max) field to hold them.
The problem is that some characters, particularly html...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM).
In this month's session, the creator of the excellent VBE...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: Aftab Ahmad |
last post by:
Hello Experts!
I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
|
by: Aftab Ahmad |
last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below.
Dim IE As Object
Set IE =...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
| |