I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.
================================================== ==============
#!/usr/bin/python
# Spam filter for mbox files. Reads mailfile and makes two new
# files, mailfile.spam and mailfile.ham, containing the spam and non-spam
# messages from mailfile as determined by piping through spamc.
# Copyright 2003 Paul Rubin <http://www.paulrubin.com>
# Copying permission: GNU General Public License ver. 2, <http://www.gnu.org>
import mailbox,os,sys
from time import time
def mktemp():
import sha,os,time
d = sha.new("spam:%s,%s"%(os.getpid(),time.time())).he xdigest()
return "spam%s.temp"% d[:10]
tempfilename = mktemp()
def main():
print sys.argv
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print "Usage: spam.py mboxfile"
print "marking up", filename
mailfile = open(filename, 'r')
ham = open(filename + ".ham", 'w')
spam = open(filename + ".spam", 'w')
mbox = mailbox.UnixMailbox(mailfile)
i = 0
while 1:
i += 1
m1 = mailfile.tell()
msg = mbox.next()
if not msg: break
body = msg.fp.read()
envelope = env_header(mailfile, m1)
print "%5d"%i, m1, mailfile.tell(), msg.startofbody, len(body),
is_spam, txt = spam_filter (envelope, msg, body)
print ['HAM','SPAM'][is_spam]
if is_spam:
spam.write(txt)
else:
ham.write(txt)
def spam_filter(envelope, msg, body):
txt = envelope + ''.join(msg.headers) + '\n' + body
out = os.popen("spamc > %s"% tempfilename, "w")
out.write(txt)
out.close()
t = mailbox.UnixMailbox(open(tempfilename))
spam_level = len(t.next().get('X-Spam-Level', ''))
txt = open(tempfilename).read()
return (spam_level >= 5, txt)
def env_header(fp, pos):
t = fp.tell()
fp.seek(pos)
e = fp.readline()
fp.seek(t)
return e
try:
t=time()
main()
dt = time()-t
print "elapsed: %d min %d sec"% divmod(int(dt), 60)
finally:
os.unlink(tempfilename) 1 1853
Paul Rubin <http://ph****@NOSPAM.invalid> writes: I was surprised there was no obvious way with spamassassin (maybe I shoulda looked at spambayes) to split an existing mbox file into its spam and non-spam messages. So I wrote one. It's pretty slow, taking around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how serious ISP's getting thousands of incoming messages per hour can run anything like spamassassin on all of them. But for my purposes it's ok. Comments and improvements are welcome.
It's my experience that mailbox is pretty slow at reading mbox files.
I have memories of speeding up some mail-statistics gathering stuff by
a large amount by implementing my own mbox "parser" (basically
s.find('\n\nFrom ') or similar, I forget). I'm not sure I'd like to
use this approach on something less forgiving than stats, though :-)
Cheers,
mwh
--
59. In English every word can be verbed. Would that it were so in
our programming languages.
-- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Tim Tyler |
last post by:
Are there any PHP mbox viewers out there?
Something /vaguely/ like: http://www.mhonarc.org/ ...?
I expect some PHP mailing list managers offer something like
this - but so far my searches have...
|
by: ian douglas |
last post by:
I have an mbox mailbox on a linux server that I need to open and
parse, message by message, looking for content via PHP. However, for
whatever reason, my hosting provider has not compiled PHP with...
|
by: Brad Tilley |
last post by:
Can Python parse a mbox file and forward each individual message within
that file to someone else?
For example, let's say I have a 10MB mbox file that has 678 messages.
I'd like to send each of...
|
by: MJackson |
last post by:
I'm new to perl. I need to read in the mailbox (mbox) and break it apart
into seperate emails. Of course each email may be different lengths. Each
email starts with From, but there are other lines...
|
by: Michael March |
last post by:
I am been playing with the 'email' and 'mailbox' modules trying to read
in an email at a time from an mbox file, add something to the body of a
message and then write that change back out in 'mbox'...
|
by: mbox |
last post by:
hi guys,
i want to know how to display message in asp.net with out using javascript..
|
by: Jonathan Pritchard |
last post by:
Does anyone know of a library to read/write mbox format (used to store
emails e.g. in Thunderbird) in ANSI C (cross platform is key)?
Is there a more appropriate place to ask this?
--
Reclaim...
|
by: matej |
last post by:
Hi,
I am writing a script to convert couple of thousand emails (in couple
of hundred folders) and before I will get to the hard part -- maintaing
structure folders and subfolders, and maintaing...
|
by: Fabian Braennstroem |
last post by:
Hi,
I am wondering, if anyone tried to convert lotus nsf mails
to a mbox format using python!? It would be nice, if anyone
has an idea, how to do it on a linux machine.
Regards!
Fabian
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |