473,396 Members | 1,971 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

How do xml parsers handle encoding?


if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

Bill
Jun 27 '08 #1
7 2180
Martin Honnen wrote:
And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.
For Mozilla, the FAQ
http://developer.mozilla.org/en/docs...l_documents.3F
says:
"Most well-formedness constraints are enforced. (Currently Mozilla
does not catch character encoding errors, because the document is
re-encoded using a lenient encoding converter before the document
reaches the XML parser. This is a bug.)"

--

Martin Honnen
http://JavaScript.FAQTs.com/
Jun 27 '08 #2
The rules for how they're *supposed* to handle it are spelled out in the
XML Recommendation. Not all parsers are in strict compliance with all
parts of the recommendation, alas. Bug Happens.

If you're asking whether you can get away with cheating: the brief
answer is that it's extremely bad practice to try. If you're asking
whether you can be certain a particular parser will or won't let
something through, you can ask its development/user community... but be
aware that the next release may fix this, and it's a very bad idea to
write code that depends on bugs in specific versions.
Jun 27 '08 #3
On Apr 30, 8:20*am, Martin Honnen <mahotr...@yahoo.dewrote:
Martin Honnen wrote:
And XML parsers are required to check that documents are properly
encoded.
So how do they do that? do they check every character? or do they just
convert? if the encoding attribute is utf8 and the file has a
character not utf8, does the browser error, convert it or what? Like
if a Korean character is in a file that says it is utf8.

Bill
Jun 27 '08 #4
In article <e9**********************************@t12g2000prg. googlegroups.com>,
<bi*********@yahoo.comwrote:
And XML parsers are required to check that documents are properly
encoded.
>So how do they do that? do they check every character?
Yes.
>Like if a Korean character is in a file that says it is utf8.
utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard
--
:wq
Jun 27 '08 #5
On Apr 30, 9:49*am, rich...@cogsci.ed.ac.uk (Richard Tobin) wrote:
In article <e96ae004-e602-4b72-a7b5-608f11ef2...@t12g2000prg.googlegroups.com>,

*<billsahi...@yahoo.comwrote:
And XML parsers are required to check that documents are properly
encoded.
So how do they do that? do they check every character?

Yes.
Like if a Korean character is in a file that says it is utf8.

utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard
--
:wq
OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

Bill
Jun 27 '08 #6
bi*********@yahoo.com wrote:
OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?
As far as I know StreamReader does not throw an exception.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jun 27 '08 #7
bi*********@yahoo.com wrote:
So how do they do that? do they check every character? or do they just
convert?
Most hand it off to an appropriate encoding-aware stream reader library
and let that code do the work. Why build a wheel when you can buy one?
Jun 27 '08 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: eparker | last post by:
Hello! I'm new to this list, so please be gentle. :) I've been scouring the net for an answer to this, as well as trying to get a little demo running, but to no avail, so I'm going to plead...
0
by: dagurp | last post by:
I have this code: import xml.parsers.expat parser = xml.parsers.expat.ParserCreate(encoding="UTF-8") text = unicode("<div>þórður</div>",'UTF-8') print parser.Parse(text,1) And this is what I...
0
by: Falomiro de Vergatiesa | last post by:
Hello: I'm using the org.xml.sax JAVA API to manage a XML file. I want to know if it exists a SAX parser ( SAX driver ) that supports Locator2 interface (in org.xml.sax.ext package ). ...
1
by: Avi Kak | last post by:
Hello: This questions relates to the behavior of the Perl SAX 2.0 parser XML::LibXML::SAX. (This behavior is also shown by the XML::SAX::Expat parser and, possibly by all other Perl SAX 2.0...
2
by: dwelch91 | last post by:
Hi, c.l.p.'ers- I am having a problem with the import of xml.parsers.expat that has gotten me completely stumped. I have two programs, one a PyQt program and one a command line (text) program...
0
by: w.m.gardella.sambeth | last post by:
Hello Pythonists: I am using SPE as python IDE on Windows, with Python 2.5.1 installed (official distro). As my mother tongue is Spanish, I had documented some modules in it (I now, I should have...
0
by: JosAH | last post by:
Greetings, welcome back at the sequel of the parsers article chapter. This part is dedicated to the ExpressionParser, the largest parser class for our little language. This class parses a...
0
by: JosAH | last post by:
Greetings, this week's article part discusses the parsers used for our little language. We will implement the parsers according to the grammar rules we defined in the second part of this...
3
by: ramganes | last post by:
Here is one xml file name file1.xml <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ns1:samns ns1:id="1239" xmlns:ns1="http://blogs.sun.com/teera/ns/samns">...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.