473,788 Members | 2,652 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to ask sax for the file encoding

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_pa rser()

parser.setFeatu re(xml.sax.hand ler.feature_ext ernal_ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandl er(c,inputFileN ame,silent)

parser.setConte ntHandler(handl er)

parser.parse(th eFile)

Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?

Thanks!

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------

Oct 4 '06
19 2595
Edward K. Ream wrote:
I'm asking this question because my app needs it :-) Imo, there is *no*
information in any xml file that can be considered irrelvant.
the encoding isn't *in* the XML file, it's an artifact of the
serialization model used for a specific XML infoset. the XML
data is pure Unicode.

</F>

Oct 4 '06 #11
Edward K. Ream wrote:
What suits me best is what the *user* specified, and that got put in the
first xml line.
are you expecting your users to write XML by hand? ouch.

</F>

Oct 4 '06 #12
are you expecting your users to write XML by hand?

Of course not. Leo has the following option:

@string new_leo_file_en coding = utf-8

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #13
Please consider adding some elements to the document itself that
describe the desired output format,

Well, that's what the encoding field in the xml line was supposed to do.
Not a bad idea though, except it changes the file format, and I would really
rather not do that.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #14
the encoding isn't *in* the XML file, it's an artifact of the
serialization model used for a specific XML infoset. the XML
data is pure Unicode.
Sorry, but no. The *file* is what I am talking about, and the way it is
encoded does, in fact, really make a difference to some users. They have a
right, I think, to expect that the original encoding gets preserved when the
file is rewritten.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #15
Try this:
[snip]
Parser.XmlDeclH andler = self.XmlDecl
[snip]

Excellent! Thanks so much.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #16
Edward K. Ream schrieb:
Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?
That's not supported in SAX. If you use Expat directly (module pyexpat),
you can set the XmlDeclHandler, which is called when the XML declaration
is received (with the parameters version, encoding, and standalone).
However, as the XML declaration is optional, this callback might
not get invoked.

Regards,
Martin
Oct 4 '06 #17
Edward K. Ream wrote:
>Please consider adding some elements to the document itself that
describe the desired output format,

Well, that's what the encoding field in the xml line was supposed to do.
As others have tried to explain, the encoding in the xml header is
not part of the document data itself, it says something about the data.
It would be a bad design decision imo to rely on this meta information
if you really meant that information to be part of the data document.
Not a bad idea though, except it changes the file format, and I would really
rather not do that.
XML allows you to easily skip any elements that you think you don't need.

--Irmen
Oct 4 '06 #18
Irmen de Jong schrieb:
As others have tried to explain, the encoding in the xml header is
not part of the document data itself, it says something about the data.
It would be a bad design decision imo to rely on this meta information
if you really meant that information to be part of the data document.
A common problem is to save the data in the same encoding that they
original had; this is what an editor typically does (you may know
Edward Ream for writing editors). XML parsers are notoriously bad
in supporting editors. There are too many lexical details that may
need to be preserved (such as the order of the attributes, and the
spaces inside the opening tag) to make it impractical to report all
that to the application.

IMO, the only way to edit XML on a level that does preserving
of the tiniest lexical details is to edit it as plain text
(i.e. without using an XML parser).

Regards,
Martin
Oct 4 '06 #19
Martin v. Löwis wrote:
A common problem is to save the data in the same encoding that they
original had; this is what an editor typically does (you may know
Edward Ream for writing editors). XML parsers are notoriously bad
in supporting editors. There are too many lexical details that may
need to be preserved (such as the order of the attributes, and the
spaces inside the opening tag) to make it impractical to report all
that to the application.
an editor designed to work on the XML serialization level shouldn't use
a traditional XML parser at all, of course. definitely not SAX or DOM,
or any other infoset-or-higher-level API.

on the other hand, an editor that just happens to use XML as a
serialization format might as well decide on a model representation
and an encoding and stick to it. being tolerant in what it accepts
is a good idea, of course, but being consistent in what it generates
is an even better idea.

</F>

Oct 5 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
5144
by: Mark Miller | last post by:
I have a char array and when I write it to a file using BinaryWriter the position of the pointer is the size of the array + 1. For example: writing char leaves the pointer at position 26 after starting at position 0. I thought that char was 2 bytes, but this makes it seem as though it is just 1 when I write to a file. Why is this? I imagine the extra bit is just a null bit (correct me if I'm wrong). I don't know if this helps but when I...
3
7775
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
5
15046
by: Lenard Gunda | last post by:
hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem occurs. The file contains characters with ASCII codes above 128. But the file is still text (nothing like UTF7/8 or the like). It also might contain + signs. As a result:
12
2992
by: Brian Henry | last post by:
first question... I have a flat file which unfortinuatly has columns seperated by nulls instead of spaces (a higher up company created it this way for us) is there anyway to do a readline with this and not have it affected by the null? because it is right now causes truncated data at wierd places... but as soon as i manually with a hex editor change char(00) to char(20) in the files it reads prerfectly... which leads me to my 2nd...
3
5128
by: Chip | last post by:
There is surprisingly little information on the various encoding options for reading a text file. I have what seems to be a very basic issue: I'm reading a text file that includes Spanish characters such as "ñ". When I read the file into a string, that character is missing. Encoding seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us know what encoding to read the file with, but most software...
1
6509
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting" setting to be "E_ALL", notices are still not getting reported. The perms on my file are 664, with owner root and group root. The php.ini file is located at /usr/local/lib/php/php.ini. Any ideas why the setting does not seem to be having an effect? ...
2
7539
by: starffly | last post by:
I want to read a xml file in Unicode, UTF-8 or a native encoding into a wchar_t type string, so i write a routine as follows, however, sometimes a Unicode file including Chinese character cannot be read completely. and I cannot tell where its root located, so NEED your help, GIVE me a hand please. THX. static Status LoadXMLFile2String(const char *filename, wchar_t *text){ FILE *f; if(!(f = fopen(filename, "r"))){ __printDebugA("Input...
1
32950
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ###Question: Now this creates finalfile.txt with ANSI Encoding ...which is a default. Either tell me how to change the default or how to create a
3
8299
by: JDeats | last post by:
I have some .NET 1.1 code that utilizes this technique for encrypting and decrypting a file. http://support.microsoft.com/kb/307010 In .NET 2.0 this approach is not fully supported (a .NET 2.0 build with these methods, will appear to encrypt and decrypt, but the resulting decrypted file will be corrupted. I tried encrypting a .bmp file and then decrypting, the resulting decrypted file under .NET 2.0 is garbage, the .NET 1.1 build works...
2
2825
by: JimmyKoolPantz | last post by:
We purchased som software for encoding a barcode. We want to automate the process of converting a number to a readable barcode. However, I am having a few issues. The file that the barcode needs to be appended to is dbf file type, I wrote a dbf writer so adding data on the fly is not a problem, the problem is trying to get the correct encoding. The barcode utility that we purchased
0
9656
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10366
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10173
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
7517
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6750
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5399
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4070
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3674
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2894
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.