not quite 1252

Anton Vredegoor

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I'd rather not
reinvent the wheel and start translating strings 'by hand'.

Anton

Apr 26 '06 #1

Subscribe Reply

1883

Fredrik Lundh

Anton Vredegoor wrote:

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?

this might help:

http://effbot.org/zone/unicode-gremlins.htm

</F>

Apr 26 '06 #2

Anton Vredegoor

Fredrik Lundh wrote:

Anton Vredegoor wrote:
I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?

this might help:

http://effbot.org/zone/unicode-gremlins.htm

Thanks a lot! The code below not only made the strange chars go away,
but it also fixed the xml-parsing errors ... Maybe it's useful to
someone else too, use at own risk though.

Anton

from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED

def repair(infn,out fn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
zout.writestr(x ,kill_gremlins( data).encode('c p1252'))
else:
zout.writestr(x ,data)
zout.close()

def test():
infn = "xxxx.sxw"
outfn = 'dg.sxw'
repair(infn,out fn)

if __name__=='__ma in__':
test()

Apr 26 '06 #3

Martin v. Löwis

Anton Vredegoor wrote:

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I'd rather not
reinvent the wheel and start translating strings 'by hand'.

Not sure I understand the question. If you process data in cp1252,
then \x94 and \x94 are legal characters, and the Python codec should
support them just fine.

Regards,
Martin

Apr 26 '06 #4

Anton Vredegoor

Martin v. Löwis wrote:

Not sure I understand the question. If you process data in cp1252,
then \x94 and \x94 are legal characters, and the Python codec should
support them just fine.

Tell that to the guys from open-office.

Anton

Apr 26 '06 #5

Serge Orlov

Anton Vredegoor wrote:

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I'd rather not
reinvent the wheel and start translating strings 'by hand'.

I extracted content.xml from a test file and the header is:
<?xml version="1.0" encoding="UTF-8"?>

So any xml library should handle it just fine, without you trying to
guess the encoding.

Apr 26 '06 #6

Martin v. Löwis

Anton Vredegoor wrote:

Not sure I understand the question. If you process data in cp1252,
then \x94 and \x94 are legal characters, and the Python codec should
support them just fine.

Tell that to the guys from open-office.

Ok, I'll rephrase: Can you please explain your problem again, in
different words?

I thought you are trying to export data *from* open-office, and
your message seems to suggest (without actually saying so) that the
document contains \x93 and \x94 (you said "there are still characters in
it like \93 or \94").

So if that is the case: What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?

Regards,
Martin

Apr 27 '06 #7

John Machin

On 27/04/2006 12:49 AM, Anton Vredegoor wrote:

Fredrik Lundh wrote:
Anton Vredegoor wrote:
I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?
this might help:

http://effbot.org/zone/unicode-gremlins.htm

Thanks a lot! The code below not only made the strange chars go away,
but it also fixed the xml-parsing errors

What xml-parsing errors were they??
... Maybe it's useful to
someone else too, use at own risk though.

Anton

from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED

def repair(infn,out fn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
Firstly, this should be 'content.xml', not 'contents.xml'.

Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8
e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in
UTF-8. The kill_gremlins function is intended to fix Unicode strings
that have been obtained by decoding 8-bit strings using 'latin1' instead
of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins
function, it changes the \x80 to a Euro symbol, and leaves the other two
alone. Because the \x9d is not defined in cp1252, it then causes your
code to die in a hole when you attempt to encode it as cp1252:
UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x9d' in
position 1761: character maps to <undefined>

I don't see how this code repairs anything (quite the contrary!), unless
there's some side effect of just read/writestr. Enlightenment, please.
zout.writestr(x ,kill_gremlins( data).encode('c p1252'))
else:
zout.writestr(x ,data)
zout.close()

Apr 27 '06 #8

Anton Vredegoor

John Machin wrote:

Firstly, this should be 'content.xml', not 'contents.xml'.
Right, the code doesn't do *anything* :-( Thanks for pointing that out.
At least it doesn't do much harm either :-|
Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8
e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in
UTF-8. The kill_gremlins function is intended to fix Unicode strings
that have been obtained by decoding 8-bit strings using 'latin1' instead
of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins
function, it changes the \x80 to a Euro symbol, and leaves the other two
alone. Because the \x9d is not defined in cp1252, it then causes your
code to die in a hole when you attempt to encode it as cp1252:
UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x9d' in
position 1761: character maps to <undefined>
Yeah, converting to cp1252 was all that was necessary, like Sergei wrote.
I don't see how this code repairs anything (quite the contrary!), unless
there's some side effect of just read/writestr. Enlightenment, please.

You're quite right. I'm extremely embarrassed now. What's left for me is
just to explain how it got this bad.

First I noticed that by extracting from content.xml using OOopy's
getiterator function, some \x94 codes were left inside the document.

But that was an *artifact*, because if one prints something using
s.__repr__() as is used for example when printing a list of strings
(duh) the output is not the same as when one prints with 'print s'. I
guess what is called then is str(s).

Ok, now we have that out of the way, I hope.

So I immediately posted a message about conversion errors, assuming
something in the open office xml file was not quite 1252. In fact it
wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to
cp1252, no problem.

Then I also noticed that not all xml-tags were printed if I just
iterated the xml-tree and filtered out only those elements with a text
attribute, like 'if x.text: print x'

In fact there are a lot of printable things that haven't got a text
attribute, for example some items with tag (xxxx)s.

When F pointed me to gremlins there was on this page the following text:

<quote>

Some applications add CP1252 (Windows, Western Europe) characters to
documents marked up as ISO 8859-1 (Latin 1) or other encodings. These
characters are not valid ISO-8859-1 characters, and may cause all sorts
of problems in processing and display applications.

</quote>

I concluded that these \x94 codes (which I didn't know about them being
a figment of my representation yet) were responsible for my iterator
skipping over some text elements, but in fact the iterator skipped them
because they had no text attribute even though they were somehow
containing text.

Now add my natural tendency to see that what I think is the case rather
than neutrally observing the world as it is into the mix and of course I
saw the \x94 disappear (but that was because I now was printing them
straight and not indirectly as elements of a list) and also I thought
that now the xml-parsing 'errors' had disappeared but that was just
because I saw some text element appear that I thought I hadn't seen
before (but in fact it was there all the time).

One man's enlightenment sometimes is another's embarrassment, or so it
seems. Thanks to you all clearing up my perceptions, and sorry about all
the confusion I created.

What I want to know next is how to access and print the elements that
contain text but have no text attribute, that is, if it's not to taxing
on my badly damaged ego.

Anton

Apr 27 '06 #9

Anton Vredegoor

Serge Orlov wrote:

I extracted content.xml from a test file and the header is:
<?xml version="1.0" encoding="UTF-8"?>

So any xml library should handle it just fine, without you trying to
guess the encoding.

Yes my header also says UTF-8. However some kind person send me an
e-mail stating that since I am getting \x94 and such output when using
repr (even if str is giving correct output) there could be some problem
with the XML-file not being completely UTF-8. Or is there some other
reason I'm getting these \x94 codes? Or maybe this is just as it should
be and there's no problem at all? Again?

Anton

'octopussies respond only off-list'

Apr 28 '06 #10

Similar topics

3011

error while encoding a txt file to win-1252

by: Paula Blau | last post by:

Hi there, I have a problem with encoding a file .txt to win-1252. $text = recode_string("ISO-8859-1..win-1252", $text); but I get the following error message: Fatal error: Call to undefined function: recode_string()

PHP

5124

Code page translations are not supported for the text data type. From: 1252 To: 950.

by: sdowney717 | last post by:

Code page translations are not supported for the text data type. From: 1252 To: 950. I would like to know what this message means. I also installed the language packs in advanced settings, Everything is set to English. My windows XP computer is XP English. For some reason I cant get an update to go thru using ADO. like Recordset.Update Although other routines using .update work

Microsoft SQL Server

4771

Cannot Read CDATA (using WINDOWS-1252 encoding)

by: Rich Wallace | last post by:

Hi all, I have an XML document fed to me from a third party app: <?xml version="1.0" encoding="WINDOWS-1252" ?> <GatewayPlan xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Diagnostics> <ErrorCode>0</ErrorCode> <ErrorDescription>OK</ErrorDescription>

.NET Framework

8564

Change format US-1252 to UTF-8

by: Ravi | last post by:

HI All, I am using Windows, Database territory = US Database code page = 1252 Database code set = IBM-1252 Database country/region code = 1 Database collating sequence = UNIQUE now I want change to UTF-8 format in DB2 , data is in databases .

DB2 Database

2231

UTF-8 changes to Windows-1252 in config file

by: Lisa | last post by:

Suddenly, the encoding in my exe.config file changes from UTF-8 to Windows-1252 every time I try to debug my winform app. This causes the old application configuration error when I try to debug the app. How can I finally convince my exe.config file to keep straight?

Visual Basic .NET

12014

There is no available conversion for the source code page "1252" to the target code page "0". Reason Code "1". SQLSTATE=57017

by: nan | last post by:

Hi All, I am trying to connect the Database which is installed in AS400 using DB2 Client Version 8 in Windows box. First i created the Catalog, then when i selected the connection type as ODBC, then i am getting

DB2 Database

2169

Disadvantages of using windows-1252 codepage?

by: DC | last post by:

We are about to go online with an ASP.Net site. We have found that it is easiest for us to use windows-1252 content encoding, since that solves our problems with some special characters. Are there some general disadvantages about using this codepage (most sites I know use utf-8 or iso) - I am thinking of things like search engine incompatibilities - or should it be OK to use 1252? Thanks for any hint in advance, Regards

ASP.NET

1634

1252 to utf-8

by: Hans Ruck | last post by:

I am writing a MIME parser and i need to convert the Windows-1252 encoded strings to utf-8. For example, the text "=?Windows-1252?Q?une_beaut=E9=?" should become "une beauté". Do you know what the conversion algorithm is? Does a "shortest" method exist in the framework? Hans.

.NET Framework

2788

Upgrade from Windows-1252 to UCS-2

by: Boris | last post by:

I'm trying to find out what the steps look like to upgrade a program (which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI" code page) to UCS-2. Currently the program reads and writes files encoded in Windows-1252 but should be able to read files encoded in UCS-2, too. As I don't want to deal with two character representations in the program I plan to use UCS-2 internally. I should be able to simply use std::wstring...

C / C++

8888

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9257

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9113

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8097

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6702

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4519

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3221

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2635

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2157

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General