472,146 Members | 1,317 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,146 software developers and data experts.

Unicode problem

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

The problem is that my software deals with italian strings that has a lot
of apostrophe and other similar simbols mapped between 2000 and 206F

Have can I resolve this issue? Should I prepocess the unicode strings or
is there a way to instruct Python to do the conversion?
Jul 7 '07 #1
6 6485
On Sat, 07 Jul 2007 16:06:03 +0000, pa******@giochinternet.com wrote:
Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....
No it shouldn't because \x2019 is a "right single quotation mark" and not
an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch
Jul 7 '07 #2
No it shouldn't because \x2019 is a "right single quotation mark" and not
an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch

I agree, but the problem is much subtle. I have coverted a text from
iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )

So if convert an apostrophe to a "right single quotation mark" why not
translate the "right single quotation mark" to "apostrophe"

As I can see it works in one direction but not in the other
Jul 7 '07 #3
I agree, but the problem is much subtle. I have coverted a text from
iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )
What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin
Jul 7 '07 #4
pa******@giochinternet.com <pa******@giochinternet.comwrote:
...
Ah, I answered you on the Italian NG before seeing you had also posted
the same request here. What I proposed there was (untested):

import codecs

_rimedi = { u'\x2019': "'" }

def rimedia(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
erore = exc.object[exc.start:exc.end]
if len(erore)==1 and erore in _rimedi: return _rimedi[erore]
raise exc
codecs.register_error('rimedia', rimedia)

outf = codecs.open('filename.txt', 'w', 'iso-8859-15', errors='rimedia')
Alex
Jul 7 '07 #5
pa******@giochinternet.com wrote:
Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....
U+2019 is RIGHT SINGLE QUOTATION MARK. The APOSTROPHE (U+0027) is a
cross-reference as a similar code point, but they're not the same thing.

Your problem is that ISO-8859-15 doesn't have the RIGHT SINGLE QUOTATION
MARK, so you'll have to do the translation yourself if you want to turn
it into a true APOSTROPHE.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
She glanced at her watch ... It was 9:23.
-- James Clavell
Jul 7 '07 #6
>
What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin
you're right...the source of text are html pages and obviously webmasters
have poor knowledge of encodings, so the meta declared the encoding as
ISO-8859-1 but the real encoding is Windows-1252 and yes it uses \x92 as
apostrophe, so the problem isn't Python
Jul 8 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by sebastien.hugues | last post: by
8 posts views Thread by Bill Eldridge | last post: by
14 posts views Thread by wolfgang haefelinger | last post: by
19 posts views Thread by Svennglenn | last post: by
reply views Thread by Saiars | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.