473,326 Members | 1,972 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

How to ask sax for the file encoding

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?

Thanks!

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------

Oct 4 '06 #1
19 2536
Edward K. Ream wrote:
Can anyone tell me how the content handler can determine the encoding of the file? Can sax
provide this info?
there is no encoding on the "inside" of an XML document; it's all Unicode.

</F>

Oct 4 '06 #2
>Can anyone tell me how the content handler can determine the encoding of
>the file? Can sax provide this info?
there is no encoding on the "inside" of an XML document; it's all Unicode.
True, but sax is reading the file, so sax is producing the unicode, so it
should (must) be able to determine the encoding. Furthermore, xml files
start with lines like:

<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
Am I missing something?

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #3
Edward K. Ream wrote:
>>Can anyone tell me how the content handler can determine the encoding of
the file? Can sax provide this info?
>there is no encoding on the "inside" of an XML document; it's all
Unicode.

True, but sax is reading the file, so sax is producing the unicode, so it
should (must) be able to determine the encoding.
It is, by reading the xml header.
Furthermore, xml files
start with lines like:

<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
Am I missing something?
That sax outputs unicode, which has no encoding associated anymore. And thus
it is a pretty much irrelevant information. It _could_ be retained, but for
what purpose?

Diez
Oct 4 '06 #4
Edward K. Ream wrote:
<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
why? that's an encoding detail, and should be completely irrelevant for
your application.
Am I missing something?
you're confusing artifacts of an external serialization format with the actual
data model. don't do that, if you can avoid it.

what's your use case ?

</F>

Oct 4 '06 #5
[The value of the encoding field] _could_ be retained, but for what
purpose?
I'm asking this question because my app needs it :-) Imo, there is *no*
information in any xml file that can be considered irrelvant. My app will
want to know the original encoding when writing the file.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #6
Edward K. Ream wrote:
>[The value of the encoding field] _could_ be retained, but for what
purpose?

I'm asking this question because my app needs it :-)
Imo, there is *no*
information in any xml file that can be considered irrelvant.
It sure is! The encoding _is_ irrelevant, in the very moment you get unicode
strings. The order of attributes is irrelevant. There is plenty of
irrelevant whitespace. And so on...
My app will
want to know the original encoding when writing the file.
When your app needs it, whatfor does it need it? If you write out xml again,
use whatever encoding suits you best. If you don't, use the encoding that
the subsequent application or processing step needs.

Diez
Oct 4 '06 #7
The encoding _is_ irrelevant, in the very moment you get unicode strings.

We shall have to disagree about this. My use case is perfectly reasonable,
imo.
If you write out xml again, use whatever encoding suits you best.
What suits me best is what the *user* specified, and that got put in the
first xml line.
I'm going to have to parse this line myself.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #8
"Edward K. Ream" <ed*******@charter.netwrites:
Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?
Try this:

<code>
from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapter 1</chapter>
</book>
"""

class MyParser(object):
def XmlDecl(self, version, encoding, standalone):
print "XmlDecl", version, encoding, standalone

def Parse(self, data):
Parser = expat.ParserCreate()
Parser.XmlDeclHandler = self.XmlDecl
Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)
</code>

--
HTH,
Rob
Oct 4 '06 #9
Edward K. Ream wrote:
What suits me best is what the *user* specified, and that got put in the
first xml line.
I'm going to have to parse this line myself.
Please consider adding some elements to the document itself that
describe the desired output format, such as:

....
<output>
<encoding>utf-8</encoding>
</output>
....

This allows the client to specify the encoding it wants to receive
the document in, even if it's different than the encoding it used
to make the first document. More flexibility. Less fooling around.

--Irmen
Oct 4 '06 #10
Edward K. Ream wrote:
I'm asking this question because my app needs it :-) Imo, there is *no*
information in any xml file that can be considered irrelvant.
the encoding isn't *in* the XML file, it's an artifact of the
serialization model used for a specific XML infoset. the XML
data is pure Unicode.

</F>

Oct 4 '06 #11
Edward K. Ream wrote:
What suits me best is what the *user* specified, and that got put in the
first xml line.
are you expecting your users to write XML by hand? ouch.

</F>

Oct 4 '06 #12
are you expecting your users to write XML by hand?

Of course not. Leo has the following option:

@string new_leo_file_encoding = utf-8

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #13
Please consider adding some elements to the document itself that
describe the desired output format,

Well, that's what the encoding field in the xml line was supposed to do.
Not a bad idea though, except it changes the file format, and I would really
rather not do that.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #14
the encoding isn't *in* the XML file, it's an artifact of the
serialization model used for a specific XML infoset. the XML
data is pure Unicode.
Sorry, but no. The *file* is what I am talking about, and the way it is
encoded does, in fact, really make a difference to some users. They have a
right, I think, to expect that the original encoding gets preserved when the
file is rewritten.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #15
Try this:
[snip]
Parser.XmlDeclHandler = self.XmlDecl
[snip]

Excellent! Thanks so much.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #16
Edward K. Ream schrieb:
Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?
That's not supported in SAX. If you use Expat directly (module pyexpat),
you can set the XmlDeclHandler, which is called when the XML declaration
is received (with the parameters version, encoding, and standalone).
However, as the XML declaration is optional, this callback might
not get invoked.

Regards,
Martin
Oct 4 '06 #17
Edward K. Ream wrote:
>Please consider adding some elements to the document itself that
describe the desired output format,

Well, that's what the encoding field in the xml line was supposed to do.
As others have tried to explain, the encoding in the xml header is
not part of the document data itself, it says something about the data.
It would be a bad design decision imo to rely on this meta information
if you really meant that information to be part of the data document.
Not a bad idea though, except it changes the file format, and I would really
rather not do that.
XML allows you to easily skip any elements that you think you don't need.

--Irmen
Oct 4 '06 #18
Irmen de Jong schrieb:
As others have tried to explain, the encoding in the xml header is
not part of the document data itself, it says something about the data.
It would be a bad design decision imo to rely on this meta information
if you really meant that information to be part of the data document.
A common problem is to save the data in the same encoding that they
original had; this is what an editor typically does (you may know
Edward Ream for writing editors). XML parsers are notoriously bad
in supporting editors. There are too many lexical details that may
need to be preserved (such as the order of the attributes, and the
spaces inside the opening tag) to make it impractical to report all
that to the application.

IMO, the only way to edit XML on a level that does preserving
of the tiniest lexical details is to edit it as plain text
(i.e. without using an XML parser).

Regards,
Martin
Oct 4 '06 #19
Martin v. Löwis wrote:
A common problem is to save the data in the same encoding that they
original had; this is what an editor typically does (you may know
Edward Ream for writing editors). XML parsers are notoriously bad
in supporting editors. There are too many lexical details that may
need to be preserved (such as the order of the attributes, and the
spaces inside the opening tag) to make it impractical to report all
that to the application.
an editor designed to work on the XML serialization level shouldn't use
a traditional XML parser at all, of course. definitely not SAX or DOM,
or any other infoset-or-higher-level API.

on the other hand, an editor that just happens to use XML as a
serialization format might as well decide on a model representation
and an encoding and stick to it. being tolerant in what it accepts
is a good idea, of course, but being consistent in what it generates
is an even better idea.

</F>

Oct 5 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Mark Miller | last post by:
I have a char array and when I write it to a file using BinaryWriter the position of the pointer is the size of the array + 1. For example: writing char leaves the pointer at position 26 after...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
5
by: Lenard Gunda | last post by:
hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem...
12
by: Brian Henry | last post by:
first question... I have a flat file which unfortinuatly has columns seperated by nulls instead of spaces (a higher up company created it this way for us) is there anyway to do a readline with this...
3
by: Chip | last post by:
There is surprisingly little information on the various encoding options for reading a text file. I have what seems to be a very basic issue: I'm reading a text file that includes Spanish...
1
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting"...
2
by: starffly | last post by:
I want to read a xml file in Unicode, UTF-8 or a native encoding into a wchar_t type string, so i write a routine as follows, however, sometimes a Unicode file including Chinese character cannot...
1
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ...
3
by: JDeats | last post by:
I have some .NET 1.1 code that utilizes this technique for encrypting and decrypting a file. http://support.microsoft.com/kb/307010 In .NET 2.0 this approach is not fully supported (a .NET 2.0...
2
by: JimmyKoolPantz | last post by:
We purchased som software for encoding a barcode. We want to automate the process of converting a number to a readable barcode. However, I am having a few issues. The file that the barcode...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.