473,774 Members | 2,138 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to ask sax for the file encoding

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_pa rser()

parser.setFeatu re(xml.sax.hand ler.feature_ext ernal_ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandl er(c,inputFileN ame,silent)

parser.setConte ntHandler(handl er)

parser.parse(th eFile)

Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?

Thanks!

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------

Oct 4 '06 #1
19 2594
Edward K. Ream wrote:
Can anyone tell me how the content handler can determine the encoding of the file? Can sax
provide this info?
there is no encoding on the "inside" of an XML document; it's all Unicode.

</F>

Oct 4 '06 #2
>Can anyone tell me how the content handler can determine the encoding of
>the file? Can sax provide this info?
there is no encoding on the "inside" of an XML document; it's all Unicode.
True, but sax is reading the file, so sax is producing the unicode, so it
should (must) be able to determine the encoding. Furthermore, xml files
start with lines like:

<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
Am I missing something?

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #3
Edward K. Ream wrote:
>>Can anyone tell me how the content handler can determine the encoding of
the file? Can sax provide this info?
>there is no encoding on the "inside" of an XML document; it's all
Unicode.

True, but sax is reading the file, so sax is producing the unicode, so it
should (must) be able to determine the encoding.
It is, by reading the xml header.
Furthermore, xml files
start with lines like:

<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
Am I missing something?
That sax outputs unicode, which has no encoding associated anymore. And thus
it is a pretty much irrelevant information. It _could_ be retained, but for
what purpose?

Diez
Oct 4 '06 #4
Edward K. Ream wrote:
<?xml version="1.0" encoding="utf-8"?>

so it would seem reasonable for sax to be able to return 'utf-8' somehow.
why? that's an encoding detail, and should be completely irrelevant for
your application.
Am I missing something?
you're confusing artifacts of an external serialization format with the actual
data model. don't do that, if you can avoid it.

what's your use case ?

</F>

Oct 4 '06 #5
[The value of the encoding field] _could_ be retained, but for what
purpose?
I'm asking this question because my app needs it :-) Imo, there is *no*
information in any xml file that can be considered irrelvant. My app will
want to know the original encoding when writing the file.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #6
Edward K. Ream wrote:
>[The value of the encoding field] _could_ be retained, but for what
purpose?

I'm asking this question because my app needs it :-)
Imo, there is *no*
information in any xml file that can be considered irrelvant.
It sure is! The encoding _is_ irrelevant, in the very moment you get unicode
strings. The order of attributes is irrelevant. There is plenty of
irrelevant whitespace. And so on...
My app will
want to know the original encoding when writing the file.
When your app needs it, whatfor does it need it? If you write out xml again,
use whatever encoding suits you best. If you don't, use the encoding that
the subsequent application or processing step needs.

Diez
Oct 4 '06 #7
The encoding _is_ irrelevant, in the very moment you get unicode strings.

We shall have to disagree about this. My use case is perfectly reasonable,
imo.
If you write out xml again, use whatever encoding suits you best.
What suits me best is what the *user* specified, and that got put in the
first xml line.
I'm going to have to parse this line myself.

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@chart er.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #8
"Edward K. Ream" <ed*******@char ter.netwrites:
Can anyone tell me how the content handler can determine the encoding of the
file? Can sax provide this info?
Try this:

<code>
from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapte r 1</chapter>
</book>
"""

class MyParser(object ):
def XmlDecl(self, version, encoding, standalone):
print "XmlDecl", version, encoding, standalone

def Parse(self, data):
Parser = expat.ParserCre ate()
Parser.XmlDeclH andler = self.XmlDecl
Parser.Parse(da ta, 1)

parser = MyParser()
parser.Parse(s)
</code>

--
HTH,
Rob
Oct 4 '06 #9
Edward K. Ream wrote:
What suits me best is what the *user* specified, and that got put in the
first xml line.
I'm going to have to parse this line myself.
Please consider adding some elements to the document itself that
describe the desired output format, such as:

....
<output>
<encoding>utf-8</encoding>
</output>
....

This allows the client to specify the encoding it wants to receive
the document in, even if it's different than the encoding it used
to make the first document. More flexibility. Less fooling around.

--Irmen
Oct 4 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
5144
by: Mark Miller | last post by:
I have a char array and when I write it to a file using BinaryWriter the position of the pointer is the size of the array + 1. For example: writing char leaves the pointer at position 26 after starting at position 0. I thought that char was 2 bytes, but this makes it seem as though it is just 1 when I write to a file. Why is this? I imagine the extra bit is just a null bit (correct me if I'm wrong). I don't know if this helps but when I...
3
7773
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
5
15046
by: Lenard Gunda | last post by:
hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem occurs. The file contains characters with ASCII codes above 128. But the file is still text (nothing like UTF7/8 or the like). It also might contain + signs. As a result:
12
2990
by: Brian Henry | last post by:
first question... I have a flat file which unfortinuatly has columns seperated by nulls instead of spaces (a higher up company created it this way for us) is there anyway to do a readline with this and not have it affected by the null? because it is right now causes truncated data at wierd places... but as soon as i manually with a hex editor change char(00) to char(20) in the files it reads prerfectly... which leads me to my 2nd...
3
5127
by: Chip | last post by:
There is surprisingly little information on the various encoding options for reading a text file. I have what seems to be a very basic issue: I'm reading a text file that includes Spanish characters such as "ñ". When I read the file into a string, that character is missing. Encoding seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us know what encoding to read the file with, but most software...
1
6509
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting" setting to be "E_ALL", notices are still not getting reported. The perms on my file are 664, with owner root and group root. The php.ini file is located at /usr/local/lib/php/php.ini. Any ideas why the setting does not seem to be having an effect? ...
2
7536
by: starffly | last post by:
I want to read a xml file in Unicode, UTF-8 or a native encoding into a wchar_t type string, so i write a routine as follows, however, sometimes a Unicode file including Chinese character cannot be read completely. and I cannot tell where its root located, so NEED your help, GIVE me a hand please. THX. static Status LoadXMLFile2String(const char *filename, wchar_t *text){ FILE *f; if(!(f = fopen(filename, "r"))){ __printDebugA("Input...
1
32947
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ###Question: Now this creates finalfile.txt with ANSI Encoding ...which is a default. Either tell me how to change the default or how to create a
3
8298
by: JDeats | last post by:
I have some .NET 1.1 code that utilizes this technique for encrypting and decrypting a file. http://support.microsoft.com/kb/307010 In .NET 2.0 this approach is not fully supported (a .NET 2.0 build with these methods, will appear to encrypt and decrypt, but the resulting decrypted file will be corrupted. I tried encrypting a .bmp file and then decrypting, the resulting decrypted file under .NET 2.0 is garbage, the .NET 1.1 build works...
2
2824
by: JimmyKoolPantz | last post by:
We purchased som software for encoding a barcode. We want to automate the process of converting a number to a readable barcode. However, I am having a few issues. The file that the barcode needs to be appended to is dbf file type, I wrote a dbf writer so adding data on the fly is not a problem, the problem is trying to get the correct encoding. The barcode utility that we purchased
0
9621
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9454
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10267
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9914
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8939
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5355
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5484
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4012
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3611
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.