473,395 Members | 1,584 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Decoding 'funky' e-mail subjects

Hi, I need a function to parse badly encoded 'Subject' headers from
e-mails, such as the following:

=?ISO-8859-1?Q?Murilo_Corr=EAa?=
=?ISO-8859-1?Q?Marcos_Mendon=E7a?=

I tried using the decode() method from mimetools but that doesn't
appear to be correct solution. I ended up coding the following:

import re

subject = "=?ISO-8859-1?Q?Murilo_Corr=EAa?="
subject = re.search("(?:=\?[^\?]*\?\Q\?)?(.*)\?=", subject)
subject = subject.group(1)

def decodeEntity(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"

subject = re.sub("=([^=].)", decodeEntity, subject)
print subject.replace("_", " ").decode("iso-8859-1")

Can anyone recommend a safer method?

Tia,

\\ jonas galvez
// jonasgalvez.com



Jul 18 '05 #1
7 2641
Have you tried decode_header from email.Header in the python email-package?

Best regards,

Oliver
Jul 18 '05 #2
Oliver Kurz wrote:
Have you tried decode_header from email.Header
in the python email-package?


Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:

import re
from string import *

def decodeHeader(h):
def firstGroup(s):
if s.group(1): return s.group(1)
return s.group()
h = re.compile("=\?[^\?]*\?q\?", re.I).sub("", h)
h = re.compile(
"=\?(?:(?:(?:(?:(?:(?:(?:(?:w)?i)?n)?d)?o)?w)?s)?| "
"(?:(?:(?:i)?s)?o)?|(?:(?:(?:u)?t)?f)?)"
"[^\.]*?(\.\.\.)?$",
re.I).sub(firstGroup, h)
h = re.sub("=.(\.\.\.)?$", firstGroup, h)
def isoEntities(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"
h = re.sub("=([^=].)", isoEntities, h)
if h[-2:] == "?=": h = h[:-2]
return replace(h, "_", " ")

print decodeHeader("=?ISO-8859-1?Q?Marcos_Mendon=E7a?=")
print decodeHeader("=?ISO-8859-1?Q?Test?=")
print decodeHeader("=?UTF-8?Q?Test?=")
print decodeHeader("Test =?windows-125...")
print decodeHeader("Test =?window-125...")
print decodeHeader("Test =?windo-1...")
print decodeHeader("Test =?wind...")
print decodeHeader("Test =?...")
print decodeHeader("Test =?w...")
print decodeHeader("Test =?iso...")


\\ jonas galvez
// jonasgalvez.com

Jul 18 '05 #3
Have you tried decode_header from email.Header in the python
email-package?


Jonas> Thanks, that works. The problem is that I need to make it
Jonas> compatible with Python 1.5.2.

Why not just include email.Header.decode_header() in your app? Something
like:

try:
from email.Header import decode_header
except ImportError:
# Python 1.5.2 compatibility...
def decode_header(...):
...

If that proves to be intractible, define yours when an ImportError is
raised. In either case, you get the best solution when you can and only
fall back to something possibly suboptimal when necessary.

Skip

Jul 18 '05 #4
"Jonas Galvez" <jg@jonasgalvez.com> writes:
Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:


A lot of those funny subjects come from spammers. Never eval anything
from anyone like that!!!
Jul 18 '05 #5
On 07 Jun 2004 17:20:02 -0700, rumours say that Paul Rubin
<http://ph****@NOSPAM.invalid> might have written:
"Jonas Galvez" <jg@jonasgalvez.com> writes:
Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:
A lot of those funny subjects come from spammers. Never eval anything
from anyone like that!!!


(The part of the code that caused Paul's comment):

try: return eval('"\\x%s"' % str)
except: return "?"

A sound advice by Paul. However, lots of those funny subjects come in
legitimate e-mails from countries where the ascii range is not enough.

So, a safer alternative to the code above is:

try: return string.atoi(str, 16)
except: return '?'
# int(s, base) was not available in 1.5.2
--
TZOTZIOY, I speak England very best,
"I have a cunning plan, m'lord" --Sean Bean as Odysseus/Ulysses
Jul 18 '05 #6
Paul Rubin wrote:
A lot of those funny subjects come from spammers. Never eval
anything from anyone like that!!!
Hi Paul, yeah, actually, that kind of 'funky' subject is very common
on mailing-lists here in Brazil (where ISO-8859-1 is the standard). A
lot of people use crappy webmail software which spills out that kind
of mess. So I'm forced to deal with it :-)

By the way, this is for a mail2rss application which will enable easy
removal/blacklisting of spam, among other things.

Christos TZOTZIOY Georgiou wrote: A sound advice by Paul. However, lots of those funny subjects come
in legitimate e-mails from countries where the ascii range is not
enough. So, a safer alternative to the code above is:

try: return string.atoi(str, 16)
except: return '?'
# int(s, base) was not available in 1.5.2


Thanks! Yeah, I tried using int(str, base) on Python 1.5.2, and I was
too lazy to look for the alternative when I was able to do that quick
and dirty eval() thingy :-)

\\ jonas galvez
// jonasgalvez.com

Jul 18 '05 #7
Bonjour !

Vous devez décoder chaque portion du sujet délimitées par =? ... ?=
puis assembler le tout.

Hi !

For each block, begin/end, by =? ... ?= you DO decode,
then, join the results.

*sorry for my poor english*

@-salutations
--
Michel Claveau
mél : http://cerbermail.com/?6J1TthIa8B


Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: steve | last post by:
Hi, I am opening a stream that is UTF encoded. I use fgetc to read the stream- which is binary safe. I add every character read to a string. But when I look at the stream, I see some...
1
by: Thomas Williams | last post by:
Hello everyone, my name is Tom W. And, I am new to the list, and have been using Python for about a year now. Anyway, I got a question! I am trying to decode MIME (base64) email from a POP3...
40
by: Peter Row | last post by:
Hi all, Here is my problem: I have a SQL Server 2000 DB with various NVarChar, NText fields in its tables. For some stupid reason the data was inserted into these fields in UTF8 encoding. ...
0
by: Johann Blake | last post by:
In my need to decode a JPEG 2000 file, I discovered like many that there was no functionality for this in the .NET Framework. Instead of forking out a pile of cash to do this, I came up with the...
5
by: Peter Jansson | last post by:
Hello group, The following code is an attempt to perform URL-decoding of URL-encoded string. Note that std::istringstream is used within the switch, within the loop. Three main issues have been...
14
by: BB | last post by:
Hello. i am trying to decode a block of (what i think is) base64 into text, and i have NO idea how to begin. I'm going to paste the whole string here, but i want to know the steps necessary to...
25
by: marcin.rzeznicki | last post by:
Hello everyone I've got a little problem with choosing the best decoding strategy for some nasty problem. I have to deal with very large files wich contain text encoded with various encodings....
9
by: KWSW | last post by:
Having settled the huffman encoding/decoding and channel modeling(thanks to the previous part on bitwise operation), the last part would be hamming encoding/decoding. Did some research as usual on...
0
by: Michele | last post by:
Hi there, I'm using a python script in conjunction with a JPype, to run java classes. So, here's the code: from jpype import * import os import random import math import sys
42
by: Santander | last post by:
how to decode HTML pages encoded like this: http://www.long2consulting.com/seeinaction2008/Simplicity_Beach_table/index.htm Is there script that will do this automatically and generate normal fully...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.