Connecting Tech Pros Worldwide Forums | Help | Site Map

Decoding 'funky' e-mail subjects

Jonas Galvez
Guest
 
Posts: n/a
#1: Jul 18 '05
Hi, I need a function to parse badly encoded 'Subject' headers from
e-mails, such as the following:

=?ISO-8859-1?Q?Murilo_Corr=EAa?=
=?ISO-8859-1?Q?Marcos_Mendon=E7a?=

I tried using the decode() method from mimetools but that doesn't
appear to be correct solution. I ended up coding the following:

import re

subject = "=?ISO-8859-1?Q?Murilo_Corr=EAa?="
subject = re.search("(?:=\?[^\?]*\?\Q\?)?(.*)\?=", subject)
subject = subject.group(1)

def decodeEntity(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"

subject = re.sub("=([^=].)", decodeEntity, subject)
print subject.replace("_", " ").decode("iso-8859-1")

Can anyone recommend a safer method?

Tia,



\\ jonas galvez
// jonasgalvez.com








Oliver Kurz
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Decoding 'funky' e-mail subjects


Have you tried decode_header from email.Header in the python email-package?



Best regards,

Oliver


Jonas Galvez
Guest
 
Posts: n/a
#3: Jul 18 '05

re: Decoding 'funky' e-mail subjects


Oliver Kurz wrote:[color=blue]
> Have you tried decode_header from email.Header
> in the python email-package?[/color]

Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:

import re
from string import *

def decodeHeader(h):
def firstGroup(s):
if s.group(1): return s.group(1)
return s.group()
h = re.compile("=\?[^\?]*\?q\?", re.I).sub("", h)
h = re.compile(
"=\?(?:(?:(?:(?:(?:(?:(?:(?:w)?i)?n)?d)?o)?w)?s)?| "
"(?:(?:(?:i)?s)?o)?|(?:(?:(?:u)?t)?f)?)"
"[^\.]*?(\.\.\.)?$",
re.I).sub(firstGroup, h)
h = re.sub("=.(\.\.\.)?$", firstGroup, h)
def isoEntities(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"
h = re.sub("=([^=].)", isoEntities, h)
if h[-2:] == "?=": h = h[:-2]
return replace(h, "_", " ")

print decodeHeader("=?ISO-8859-1?Q?Marcos_Mendon=E7a?=")
print decodeHeader("=?ISO-8859-1?Q?Test?=")
print decodeHeader("=?UTF-8?Q?Test?=")
print decodeHeader("Test =?windows-125...")
print decodeHeader("Test =?window-125...")
print decodeHeader("Test =?windo-1...")
print decodeHeader("Test =?wind...")
print decodeHeader("Test =?...")
print decodeHeader("Test =?w...")
print decodeHeader("Test =?iso...")




\\ jonas galvez
// jonasgalvez.com





Skip Montanaro
Guest
 
Posts: n/a
#4: Jul 18 '05

re: Decoding 'funky' e-mail subjects


[color=blue][color=green]
>> Have you tried decode_header from email.Header in the python
>> email-package?[/color][/color]

Jonas> Thanks, that works. The problem is that I need to make it
Jonas> compatible with Python 1.5.2.

Why not just include email.Header.decode_header() in your app? Something
like:

try:
from email.Header import decode_header
except ImportError:
# Python 1.5.2 compatibility...
def decode_header(...):
...

If that proves to be intractible, define yours when an ImportError is
raised. In either case, you get the best solution when you can and only
fall back to something possibly suboptimal when necessary.

Skip

Paul Rubin
Guest
 
Posts: n/a
#5: Jul 18 '05

re: Decoding 'funky' e-mail subjects


"Jonas Galvez" <jg@jonasgalvez.com> writes:[color=blue]
> Thanks, that works. The problem is that I need to make it compatible
> with Python 1.5.2. I improved my regex-based method and it has worked
> fine with all my test cases so far. But if anyone has any other
> suggestion, I'm still interested. Anyway, here's my code:[/color]

A lot of those funny subjects come from spammers. Never eval anything
from anyone like that!!!
Christos TZOTZIOY Georgiou
Guest
 
Posts: n/a
#6: Jul 18 '05

re: Decoding 'funky' e-mail subjects


On 07 Jun 2004 17:20:02 -0700, rumours say that Paul Rubin
<http://phr.cx@NOSPAM.invalid> might have written:
[color=blue]
>"Jonas Galvez" <jg@jonasgalvez.com> writes:[color=green]
>> Thanks, that works. The problem is that I need to make it compatible
>> with Python 1.5.2. I improved my regex-based method and it has worked
>> fine with all my test cases so far. But if anyone has any other
>> suggestion, I'm still interested. Anyway, here's my code:[/color][/color]
[color=blue]
>A lot of those funny subjects come from spammers. Never eval anything
>from anyone like that!!![/color]

(The part of the code that caused Paul's comment):

try: return eval('"\\x%s"' % str)
except: return "?"

A sound advice by Paul. However, lots of those funny subjects come in
legitimate e-mails from countries where the ascii range is not enough.

So, a safer alternative to the code above is:

try: return string.atoi(str, 16)
except: return '?'
# int(s, base) was not available in 1.5.2
--
TZOTZIOY, I speak England very best,
"I have a cunning plan, m'lord" --Sean Bean as Odysseus/Ulysses
Jonas Galvez
Guest
 
Posts: n/a
#7: Jul 18 '05

re: Decoding 'funky' e-mail subjects


Paul Rubin wrote:[color=blue]
> A lot of those funny subjects come from spammers. Never eval
> anything from anyone like that!!![/color]

Hi Paul, yeah, actually, that kind of 'funky' subject is very common
on mailing-lists here in Brazil (where ISO-8859-1 is the standard). A
lot of people use crappy webmail software which spills out that kind
of mess. So I'm forced to deal with it :-)

By the way, this is for a mail2rss application which will enable easy
removal/blacklisting of spam, among other things.

Christos TZOTZIOY Georgiou wrote:[color=blue]
> A sound advice by Paul. However, lots of those funny subjects come
> in legitimate e-mails from countries where the ascii range is not
> enough. So, a safer alternative to the code above is:
>
> try: return string.atoi(str, 16)
> except: return '?'
> # int(s, base) was not available in 1.5.2[/color]

Thanks! Yeah, I tried using int(str, base) on Python 1.5.2, and I was
too lazy to look for the alternative when I was able to do that quick
and dirty eval() thingy :-)



\\ jonas galvez
// jonasgalvez.com





Michel Claveau/Hamster
Guest
 
Posts: n/a
#8: Jul 18 '05

re: Decoding 'funky' e-mail subjects


Bonjour !

Vous devez décoder chaque portion du sujet délimitées par =? ... ?=
puis assembler le tout.



Hi !

For each block, begin/end, by =? ... ?= you DO decode,
then, join the results.



*sorry for my poor english*



@-salutations
--
Michel Claveau
mél : http://cerbermail.com/?6J1TthIa8B




Closed Thread