By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,627 Members | 1,211 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,627 IT Pros & Developers. It's quick & easy.

Handling some isolated iso-8859-1 characters

P: n/a
I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<no*@aol.com Sun, 21 Nov 2004 16:21:50 -0500
<lm***************************@40tude.net 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?

Jun 27 '08 #1
Share this Question
Share on Google+
5 Replies


P: n/a
On Jun 4, 2:38 am, Daniel Mahoney <d...@catfolks.netwrote:
I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<n...@aol.com Sun, 21 Nov 2004 16:21:50 -0500
<lmzdkqmqt2fj.54wmpv3zmvvx....@40tude.net 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?
>>from email.Header import decode_header
decode_header("=?iso-8859-1?Q?Ana=EFs?=")
[('Ana\xefs', 'iso-8859-1')]
>>(s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=")
s
'Ana\xefs'
>>e
'iso-8859-1'
>>s.decode(e)
u'Ana\xefs'
>>import unicodedata
import htmlentitydefs
for c in s.decode(e):
.... print ord(c), unicodedata.name(c)
....
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S
>>htmlentitydefs.codepoint2name[239]
'iuml'
>>>
Jun 27 '08 #2

P: n/a
En Tue, 03 Jun 2008 15:38:09 -0300, Daniel Mahoney <da*@catfolks.net>
escribió:
I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle.
=?iso-8859-1?Q?Ana=EFs?="
<no*@aol.com Sun, 21 Nov 2004 16:21:50 -0500
<lm***************************@40tude.net 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads
"=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?
No, it's not you, those headers are formatted following RFC 2047
<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,
see <http://docs.python.org/lib/module-email.header.html>

--
Gabriel Genellina

Jun 27 '08 #3

P: n/a
No, it's not you, those headers are formatted following RFC 2047
<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,
see <http://docs.python.org/lib/module-email.header.html>
Excellent, that's exactly what I was looking for. Thanks!

Jun 27 '08 #4

P: n/a
... print ord(c), unicodedata.name(c)
...
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S
Looks like I need to explore the unicodedata class. Thanks!
Jun 27 '08 #5

P: n/a
Daniel Mahoney skrev:
The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

There is a mention of email headers and unicode in the end of this article:

http://mxm-mad-science.blogspot.com/...school-of.html

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Jun 27 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.