473,569 Members | 2,770 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

mbox despamming script

I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.

=============== =============== =============== =============== ====

#!/usr/bin/python

# Spam filter for mbox files. Reads mailfile and makes two new
# files, mailfile.spam and mailfile.ham, containing the spam and non-spam
# messages from mailfile as determined by piping through spamc.

# Copyright 2003 Paul Rubin <http://www.paulrubin.c om>
# Copying permission: GNU General Public License ver. 2, <http://www.gnu.org>

import mailbox,os,sys
from time import time

def mktemp():
import sha,os,time
d = sha.new("spam:% s,%s"%(os.getpi d(),time.time() )).hexdigest()
return "spam%s.tem p"% d[:10]

tempfilename = mktemp()

def main():
print sys.argv
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print "Usage: spam.py mboxfile"

print "marking up", filename
mailfile = open(filename, 'r')
ham = open(filename + ".ham", 'w')
spam = open(filename + ".spam", 'w')

mbox = mailbox.UnixMai lbox(mailfile)
i = 0

while 1:
i += 1
m1 = mailfile.tell()
msg = mbox.next()
if not msg: break
body = msg.fp.read()
envelope = env_header(mail file, m1)
print "%5d"%i, m1, mailfile.tell() , msg.startofbody , len(body),
is_spam, txt = spam_filter (envelope, msg, body)
print ['HAM','SPAM'][is_spam]

if is_spam:
spam.write(txt)
else:
ham.write(txt)

def spam_filter(env elope, msg, body):
txt = envelope + ''.join(msg.hea ders) + '\n' + body
out = os.popen("spamc > %s"% tempfilename, "w")
out.write(txt)
out.close()

t = mailbox.UnixMai lbox(open(tempf ilename))
spam_level = len(t.next().ge t('X-Spam-Level', ''))
txt = open(tempfilena me).read()
return (spam_level >= 5, txt)

def env_header(fp, pos):
t = fp.tell()
fp.seek(pos)
e = fp.readline()
fp.seek(t)
return e

try:
t=time()
main()
dt = time()-t
print "elapsed: %d min %d sec"% divmod(int(dt), 60)
finally:
os.unlink(tempf ilename)
Jul 18 '05 #1
1 1860
Paul Rubin <http://ph****@NOSPAM.i nvalid> writes:
I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.


It's my experience that mailbox is pretty slow at reading mbox files.
I have memories of speeding up some mail-statistics gathering stuff by
a large amount by implementing my own mbox "parser" (basically
s.find('\n\nFro m ') or similar, I forget). I'm not sure I'd like to
use this approach on something less forgiving than stats, though :-)

Cheers,
mwh

--
59. In English every word can be verbed. Would that it were so in
our programming languages.
-- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4049
by: Tim Tyler | last post by:
Are there any PHP mbox viewers out there? Something /vaguely/ like: http://www.mhonarc.org/ ...? I expect some PHP mailing list managers offer something like this - but so far my searches have not succeeded in locating one that does. Thanks in advance for any assistance. --
0
1796
by: ian douglas | last post by:
I have an mbox mailbox on a linux server that I need to open and parse, message by message, looking for content via PHP. However, for whatever reason, my hosting provider has not compiled PHP with IMAP support. Does anyone have anything written that I can download and use (free would be nice, heh) to parse my mailbox looking for keywords...
2
3137
by: Brad Tilley | last post by:
Can Python parse a mbox file and forward each individual message within that file to someone else? For example, let's say I have a 10MB mbox file that has 678 messages. I'd like to send each of the messages to santa.claus@northpole.com... how might I do that? Also, how might I handle messages that have an attachment associated with them...
1
2283
by: MJackson | last post by:
I'm new to perl. I need to read in the mailbox (mbox) and break it apart into seperate emails. Of course each email may be different lengths. Each email starts with From, but there are other lines in the email starting with From: (notice the colon). So, an email starts at From and ends at the character prior to the next From. I then need to...
1
1767
by: Michael March | last post by:
I am been playing with the 'email' and 'mailbox' modules trying to read in an email at a time from an mbox file, add something to the body of a message and then write that change back out in 'mbox' format again. Is there any code floating around out there that allows you to take a message from 'mailbox', modify it with 'email' and then write...
2
15905
by: mbox | last post by:
hi guys, i want to know how to display message in asp.net with out using javascript..
11
4547
by: Jonathan Pritchard | last post by:
Does anyone know of a library to read/write mbox format (used to store emails e.g. in Thunderbird) in ANSI C (cross platform is key)? Is there a more appropriate place to ask this? -- Reclaim Your Inbox! http://www.mozilla.org/products/thunderbird
0
1629
by: matej | last post by:
Hi, I am writing a script to convert couple of thousand emails (in couple of hundred folders) and before I will get to the hard part -- maintaing structure folders and subfolders, and maintaing record of the status of the message, I would like to be sure that I have at least maildir->mbox conversion right. Could anybody comment on the below...
7
4785
by: Fabian Braennstroem | last post by:
Hi, I am wondering, if anyone tried to convert lotus nsf mails to a mbox format using python!? It would be nice, if anyone has an idea, how to do it on a linux machine. Regards! Fabian
0
7921
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8118
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7666
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6278
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5504
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5217
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3651
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3636
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2107
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.