By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,677 Members | 1,060 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,677 IT Pros & Developers. It's quick & easy.

Standard module for parsing emails?

P: n/a
Is there a standard library for parsing emails that can cope with the
different way email clients quote?
Jul 30 '08 #1
Share this Question
Share on Google+
13 Replies


P: n/a
Phillip B Oldham wrote:
Is there a standard library for parsing emails that can cope with the
different way email clients quote?
AFAIK not - as unfortunately that's something the user can configure, and
thus no atrocity is unimaginable. Hard to write a module for that...

All you can try is to apply a heuristic like "if there are lines all
starting with a certain prefix that contains non-alphanumeric characters".
But then if the user configures to quote using

XX

you're doomed...

Diez
Jul 30 '08 #2

P: n/a
Phillip B Oldham <ph************@gmail.comwrites:
Is there a standard library for parsing emails that can cope with
the different way email clients quote?
"Cope with" in what sense? i.e., what would the behaviour of such a
library be? What would it do?

Note also that it's not merely the mail client that does the quoting;
frequently the user composing the message will have a heavy hand in
how the quoted material appears.

--
\ “Time flies like an arrow. Fruit flies like a banana.” —Groucho |
`\ Marx |
_o__) |
Ben Finney
Jul 30 '08 #3

P: n/a
Phillip B Oldham schrieb:
Is there a standard library for parsing emails that can cope with the
different way email clients quote?
What do you mean with "quote" here?
1. Encode utf8/latin1 to ascii
2. Prefix of quoted text like your text above in my mail

Thomas
--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de
Jul 30 '08 #4

P: n/a
On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
What do you mean with "quote" here?
2. Prefix of quoted text like your text above in my mail
Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.

Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.
Jul 30 '08 #5

P: n/a
If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?
Jul 30 '08 #6

P: n/a
Le Wednesday 30 July 2008 17:15:07 Phillip B Oldham, vous avez crit*:
If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?
--
http://mail.python.org/mailman/listinfo/python-list
There are both shipped with python, email module and poplib, both very well
documented in the official doc (with examples and all).

email module is rather easy to use, and really powerful, but you'l need to
manage yourself the many ways email clients compose a message, and broken php
webmails that doesn't respect RFCs (notably about encoding)...

--
_____________

Maric Michaud
Jul 30 '08 #7

P: n/a
Phillip B Oldham wrote:
If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?
--
http://mail.python.org/mailman/listinfo/python-list

The search [1] yielded these results:
1) http://docs.python.org/lib/module-email.html
2)
http://www.devshed.com/c/a/Python/Py...Email-Parsing/

I have used the email module very successfully.

Also you can try the following to connect to mailboxes:
1) poplib
2) smtplib

For parsing the mails I would recommend pyparsing.
[1]
http://www.google.com/search?client=...utf-8&oe=utf-8

Regards

Nicolaas

--

The three things to remember about Llamas:
1) They are harmless
2) They are deadly
3) They are made of lava, and thus nice to cuddle.
Jul 30 '08 #8

P: n/a
Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez crit*:
For parsing the mails I would recommend pyparsing.
Why ? email module is a great parser IMO.

--
_____________

Maric Michaud
Jul 30 '08 #9

P: n/a
On Jul 30, 3:11*pm, Phillip B Oldham <phillip.old...@gmail.comwrote:
On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
What do you mean with "quote" here?
* 2. Prefix of quoted text like your text above in my mail

Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.

Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.
The problem is that sometimes lines might start with ">" for other
reasons, eg text copied from an interactive Python session, which
could occur in ... um ... _this_ newsgroup. :-)
Jul 30 '08 #10

P: n/a
Maric Michaud wrote:
Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit*:
>For parsing the mails I would recommend pyparsing.

Why ? email module is a great parser IMO.
He talks about parsing the *content*, not the email envelope and possible
mime-body.

Diez
Jul 30 '08 #11

P: n/a
Le Wednesday 30 July 2008 19:25:31 Diez B. Roggisch, vous avez écrit*:
Maric Michaud wrote:
Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit*:
For parsing the mails I would recommend pyparsing.
Why ? email module is a great parser IMO.

He talks about parsing the *content*, not the email envelope and possible
mime-body.
Yes ? I don't know what the OP want to do with the content, but if it's just
filtering the lines begining with a '>', pyparsing might be a bit
overweighted.

--
_____________

Maric Michaud

Jul 30 '08 #12

P: n/a
On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was hoping
someone might've seen the problem previously and released some code.
My sympathies.

I've even seen clients that prefix new (unquoted) text with the quote
character ">".

Well, possibly it's not the mail client, but the user. Who knows?

I will sometimes quote text like this:

[quote]
Something quoted.
[end quote]

But I'm writing for a human audience, not for a program.

The simple answer is that you can catch 90% of cases by checking for ">",
and another 1% by checking for "|". If the email contains HTML, I have
found that quoted text is sometimes in another colour. As for the rest,
well, sometimes even human beings can't easily determine what's quoted
and what isn't. Good luck getting a program to do it.

(Percentages are plucked out of thin air. YMMV.)
--
Steven
Jul 31 '08 #13

P: n/a
On Thu, 31 Jul 2008 02:25:37 +0000, Steven D'Aprano wrote:
On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
>Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.

My sympathies.

I've even seen clients that prefix new (unquoted) text with the quote
character ">".

Well, this is a new one I've never seen before: found on the python-dev
mailing list, somebody who (apparently) marks quoted text by inserting a
bare quote character on an otherwise empty line after each line of text,
similar to this:

I've even seen clients that prefix new (unquoted) text with the quote
>
character ">".
>
The user in question seems to be using gmail. I suspect a PEBCAK error.

--
Steven
Jul 31 '08 #14

This discussion thread is closed

Replies have been disabled for this discussion.