Python and encodings drives me crazy

Oliver Andrich

Hi everybody,

I have to write a little skript, that reads some nasty xml formated
files. "Nasty xml formated" means, we have a xml like syntax, no dtd,
use html entities without declaration and so on. A task as I like it.
My task looks like that...

1. read the data from the file.
2. get rid of the html entities
3. parse the stuff to extract the content of two tags.
4. create a new string from the extracted content
5. write it to a cp850 or even better macroman encoded file

Well, step 1 is easy and obvious. Step is solved for me by

===== code =====

from htmlentitydefs import entitydefs

html2text = []
for k,v in entitydefs.items():
if v[0] != "&":
html2text.append(["&"+k+";" , v])
else:
html2text.append(["&"+k+";", ""])

def remove_html_entities(data):
for html, char in html2text:
data = apply(string.replace, [data, html, char])
return data

===== code =====

Step 3 + 4 also work fine so far. But step 5 drives me completely
crazy, cause I get a lot of nice exception from the codecs module.

Hopefully someone can help me with that.

If my code for processing the file looks like that:

def process_file(file_name):
data = codecs.open(file_name, "r", "latin1").read()
data = remove_html_entities(data)
dom = parseString(data)
print data

I get

Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 33, in process_file
data = remove_html_entities(data)
File "ag2blvd.py", line 39, in remove_html_entities
data = apply(string.replace, [data, html, char])
File "/usr/lib/python2.4/string.py", line 519, in replace
return s.replace(old, new, maxsplit)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position
0: ordinal not in range(128)

I am pretty sure that I have iso-latin-1 files, but after running
through my code everything looks pretty broken. If I remove the call
to remove_html_entities I get

Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 35, in process_file
print data
UnicodeEncodeError: 'ascii' codec can't encode character u'\x96' in
position 2482: ordinal not in range(128)

And this continues, when I try to write to a file in macroman encoding.

As I am pretty sure, that I am doing something completely wrong and I
also haven't found a trace in the fantastic cookbook, I like to ask
for help here. :)

I am also pretty sure, that I do something wrong as writing a unicode
string with german umlauts to a macroman file opened via the codecs
module works fine.

Hopefully someone can help me. :)

Best regards,
Oliver

--
Oliver Andrich <ol************@gmail.com> --- http://fitheach.de/

Jul 19 '05 #1

Subscribe Reply

3363

Steven Bethard

Oliver Andrich wrote:

def remove_html_entities(data):
for html, char in html2text:
data = apply(string.replace, [data, html, char])
return data
I know this isn't your question, but why write:
data = apply(string.replace, [data, html, char])

when you could write

data = data.replace(html, char)

??

STeVe

Jul 19 '05 #2

Oliver Andrich

> I know this isn't your question, but why write:

> data = apply(string.replace, [data, html, char])

when you could write

data = data.replace(html, char)

??

Cause I guess, that I am already blind. Thanks.

Oliver

--
Oliver Andrich <ol************@gmail.com> --- http://fitheach.de/

Jul 19 '05 #3

Oliver Andrich

Well, I narrowed my problem down to writing a macroman or cp850 file
using the codecs module. The rest was basically a misunderstanding
about codecs module and the wrong assumption, that my input data is
iso-latin-1 encode. It is UTF-8 encoded. So, curently I am at the
point where I have my data ready for writing....

Does the following code write headline and caption in MacRoman
encoding to the disk? Or better that, is this the way to do it?
headline and caption are both unicode strings.

f = codecs.open(outfilename, "w", "macroman")
f.write(headline)
f.write("\n\n")
f.write(caption)
f.close()

Best regards,
Oliver

--
Oliver Andrich <ol************@gmail.com> --- http://fitheach.de/

Jul 19 '05 #4

Diez B. Roggisch

Oliver Andrich wrote:

Well, I narrowed my problem down to writing a macroman or cp850 file
using the codecs module. The rest was basically a misunderstanding
about codecs module and the wrong assumption, that my input data is
iso-latin-1 encode. It is UTF-8 encoded. So, curently I am at the
point where I have my data ready for writing....

Does the following code write headline and caption in MacRoman
encoding to the disk? Or better that, is this the way to do it?
headline and caption are both unicode strings.

f = codecs.open(outfilename, "w", "macroman")
f.write(headline)
f.write("\n\n")
f.write(caption)
f.close()

looks ok - but you should use u"\n\n" in general - if that line for some
reason chages to "öäü" (german umlauts), you'll get the error you
already observed. But using u"äöü" the parser pukes at you when the
specified coding of the file can't decode that bytes to the unicode object.

Most problems occdure when one confuses unicode objects with strings -
this requires a coercion that will be done using the default encoding
error you already observed.
Diez

Jul 19 '05 #5

Konstantin Veretennicov

On 6/20/05, Oliver Andrich <ol************@gmail.com> wrote:

Does the following code write headline and caption in
MacRoman encoding to the disk?

f = codecs.open(outfilename, "w", "macroman")
f.write(headline)

It does, as long as headline and caption *can* actually be encoded as
macroman. After you decode headline from utf-8 it will be unicode and
not all unicode characters can be mapped to macroman:

u'\u0160'.encode('utf8') '\xc5\xa0' u'\u0160'.encode('latin2') '\xa9' u'\u0160'.encode('macroman')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "D:\python\2.4\lib\encodings\mac_roman.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in position
0: character maps to <undefined>

- kv

Jul 19 '05 #6

Oliver Andrich

2005/6/21, Konstantin Veretennicov <kv***********@gmail.com>:

It does, as long as headline and caption *can* actually be encoded as
macroman. After you decode headline from utf-8 it will be unicode and
not all unicode characters can be mapped to macroman:

u'\u0160'.encode('utf8') '\xc5\xa0' u'\u0160'.encode('latin2') '\xa9' u'\u0160'.encode('macroman')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "D:\python\2.4\lib\encodings\mac_roman.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in position
0: character maps to <undefined>

Yes, this and the coersion problems Diez mentioned were the problems I
faced. Now I have written a little cleanup method, that removes the
bad characters from the input and finally I guess I have macroman
encoded files. But we will see, as soon as I try to open them on the
Mac. But now I am more or less satisfied, as only 3 obvious files
aren't converted correctly and the other 1000 files are.

Thanks for your hints, tips and so on. Good Night.

Oliver

--
Oliver Andrich <ol************@gmail.com> --- http://fitheach.de/

Jul 19 '05 #7

John Machin

Oliver Andrich wrote:

2005/6/21, Konstantin Veretennicov <kv***********@gmail.com>:
It does, as long as headline and caption *can* actually be encoded as
macroman. After you decode headline from utf-8 it will be unicode and
not all unicode characters can be mapped to macroman:

>u'\u0160'.encode('utf8')

'\xc5\xa0'
>u'\u0160'.encode('latin2')

'\xa9'
>u'\u0160'.encode('macroman')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "D:\python\2.4\lib\encodings\mac_roman.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in position
0: character maps to <undefined>

Yes, this and the coersion problems Diez mentioned were the problems I
faced. Now I have written a little cleanup method, that removes the
bad characters from the input

By "bad characters", do you mean characters that are in Unicode but not
in MacRoman?

By "removes the bad characters", do you mean "deletes", or do you mean
"substitutes one or more MacRoman characters"?

If all you want to do is torch the bad guys, you don't have to write "a
little cleanup method".

To leave a tombstone for the bad guys:

u'abc\u0160def'.encode('macroman', 'replace') 'abc?def'
To leave no memorial, only a cognitive gap:
u'The Good Soldier \u0160vejk'.encode('macroman', 'ignore')

'The Good Soldier vejk'

Do you *really* need to encode it as MacRoman? Can't the Mac app
understand utf8?

You mentioned cp850 in an earlier post. What would you be feeding
cp850-encoded data that doesn't understand cp1252, and isn't in a museum?

Cheers,
John

Jul 19 '05 #8

by: doltharz | last post by:

Please Help me i'm doing something i though was to be REALLY EASY but it drives me crazy The complete code is at the end of the email (i mean newsgroup article), i always use Option...

Latest Bytes

python -U problem for 2.4.3c1 on Windows 2000 (was Does -U optionreally exist?)

by: Petr Prikryl | last post by:

I did observe the problem when using the -U option on Windows 2000. Seems like some infinite recursion in cp1250.py -- see below. I did not try it with earlier versions of Python. Can this...

Latest Bytes

Static linking of python and pyqt

by: Markus Dahlbokum | last post by:

Hello, I'm trying to link python statically with qt and pyqt. I've tried this in several ways but never succeeded. At the moment the final make runs without errors but I get import errors when...

Latest Bytes

Re: Where to locate existing standard encodings in python

by: Philip Semanchuk | last post by:

On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a...

Latest Bytes

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

Latest Bytes

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Latest Bytes

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

Latest Bytes

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Latest Bytes

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Latest Bytes

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Latest Bytes

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Latest Bytes

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

Latest Bytes

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Latest Bytes

Python and encodings drives me crazy

Similar topics