473,396 Members | 1,847 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

mbox despamming script

I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.

================================================== ==============

#!/usr/bin/python

# Spam filter for mbox files. Reads mailfile and makes two new
# files, mailfile.spam and mailfile.ham, containing the spam and non-spam
# messages from mailfile as determined by piping through spamc.

# Copyright 2003 Paul Rubin <http://www.paulrubin.com>
# Copying permission: GNU General Public License ver. 2, <http://www.gnu.org>

import mailbox,os,sys
from time import time

def mktemp():
import sha,os,time
d = sha.new("spam:%s,%s"%(os.getpid(),time.time())).he xdigest()
return "spam%s.temp"% d[:10]

tempfilename = mktemp()

def main():
print sys.argv
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print "Usage: spam.py mboxfile"

print "marking up", filename
mailfile = open(filename, 'r')
ham = open(filename + ".ham", 'w')
spam = open(filename + ".spam", 'w')

mbox = mailbox.UnixMailbox(mailfile)
i = 0

while 1:
i += 1
m1 = mailfile.tell()
msg = mbox.next()
if not msg: break
body = msg.fp.read()
envelope = env_header(mailfile, m1)
print "%5d"%i, m1, mailfile.tell(), msg.startofbody, len(body),
is_spam, txt = spam_filter (envelope, msg, body)
print ['HAM','SPAM'][is_spam]

if is_spam:
spam.write(txt)
else:
ham.write(txt)

def spam_filter(envelope, msg, body):
txt = envelope + ''.join(msg.headers) + '\n' + body
out = os.popen("spamc > %s"% tempfilename, "w")
out.write(txt)
out.close()

t = mailbox.UnixMailbox(open(tempfilename))
spam_level = len(t.next().get('X-Spam-Level', ''))
txt = open(tempfilename).read()
return (spam_level >= 5, txt)

def env_header(fp, pos):
t = fp.tell()
fp.seek(pos)
e = fp.readline()
fp.seek(t)
return e

try:
t=time()
main()
dt = time()-t
print "elapsed: %d min %d sec"% divmod(int(dt), 60)
finally:
os.unlink(tempfilename)
Jul 18 '05 #1
1 1853
Paul Rubin <http://ph****@NOSPAM.invalid> writes:
I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.


It's my experience that mailbox is pretty slow at reading mbox files.
I have memories of speeding up some mail-statistics gathering stuff by
a large amount by implementing my own mbox "parser" (basically
s.find('\n\nFrom ') or similar, I forget). I'm not sure I'd like to
use this approach on something less forgiving than stats, though :-)

Cheers,
mwh

--
59. In English every word can be verbed. Would that it were so in
our programming languages.
-- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Tim Tyler | last post by:
Are there any PHP mbox viewers out there? Something /vaguely/ like: http://www.mhonarc.org/ ...? I expect some PHP mailing list managers offer something like this - but so far my searches have...
0
by: ian douglas | last post by:
I have an mbox mailbox on a linux server that I need to open and parse, message by message, looking for content via PHP. However, for whatever reason, my hosting provider has not compiled PHP with...
2
by: Brad Tilley | last post by:
Can Python parse a mbox file and forward each individual message within that file to someone else? For example, let's say I have a 10MB mbox file that has 678 messages. I'd like to send each of...
1
by: MJackson | last post by:
I'm new to perl. I need to read in the mailbox (mbox) and break it apart into seperate emails. Of course each email may be different lengths. Each email starts with From, but there are other lines...
1
by: Michael March | last post by:
I am been playing with the 'email' and 'mailbox' modules trying to read in an email at a time from an mbox file, add something to the body of a message and then write that change back out in 'mbox'...
2
by: mbox | last post by:
hi guys, i want to know how to display message in asp.net with out using javascript..
11
by: Jonathan Pritchard | last post by:
Does anyone know of a library to read/write mbox format (used to store emails e.g. in Thunderbird) in ANSI C (cross platform is key)? Is there a more appropriate place to ask this? -- Reclaim...
0
by: matej | last post by:
Hi, I am writing a script to convert couple of thousand emails (in couple of hundred folders) and before I will get to the hard part -- maintaing structure folders and subfolders, and maintaing...
7
by: Fabian Braennstroem | last post by:
Hi, I am wondering, if anyone tried to convert lotus nsf mails to a mbox format using python!? It would be nice, if anyone has an idea, how to do it on a linux machine. Regards! Fabian
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.