da*********@postmark.net wrote:
I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.
Well for XML there are clear rules, if there is no XML declaration
specifying the encoding then it can only be UTF-8 or UTF-16 encoded and
that is something you can decide with the BOM respectively the existance
of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional).
So look at the BOM and the XML declaration (that <?xml
version="version.number" encoding="encoding-is-here"?>) to find the
encoding for XML:
<http://www.w3.org/TR/REC-xml/#charencoding>
Of course what you really do with the above is detect the encoding the
XML document is supposed to be in and an XML parser then has to check
the whole document to comply with that encoding, e.g. if you read the
XML declaration saying encoding="ISO-8859-1" that means the XML is
supposed to be in that encoding and a parser then checks whether any
byte sequences are encountered which can't be decoded properly using
that encoding.
In general there needs to be a declaration of the encoding associated
with a document (e.g. in XML in the XML declaration, in HTML in a <meta>
element, or for resources accessed via HTTP in the response header) as
there is no general algorithm to detect any encoding that exists. For
instance you can not detect whether a document is meant to be ISO-8859-1
encoded or ISO-8859-15 encoded, the document author has to declare the
encoding, the same bytes are just interpreted as different characters.
--
Martin Honnen
http://JavaScript.FAQTs.com/