Tuesday, December 18, 2007

Making Archivemail work with DSpam

Ive got an dspam "appliance" where the enterprise emails filter through. I've set it up so that only one dspam user is used to filter all the emails. This has worked well over the past few years, but managing it has been quite a chore. Every morning, I'd have to wade through the emails in the quarantine (about 15K), and free up any False Positives which were caught.

Beyond the 58% spam confidence as reported by DSpam is pretty much spam. Below that, between the 47% - 57% there may exist one or two False Positives.

After freeing them up, deleting the remaining emails is a huge chore, because the DSpam UI will not allow deleting the quarantine file when new spam pops in.

So I needed a little program which would scan the quarantine mbox file and delete off any messages which are 58% or higher spam confidence.

I tried the most obvious program called 'archivemail', which was readily available in all distros, but was disappointed that it only allowed filtering on the messages age. There was a mysterious "Filter" switch but it only applied to IMAP mailboxes.

The great thing about this is that archivemail, like the entire emailling stack on my servers, is its completely Free Software. I just had to invest some time to look at the code. archivemail lived in /usr/bin/. I had a look at the file, and its a very small 1500 line python script!

I haven't programmed in python before, but looking at the code, it didn't look too scary. It had classes, but no colons. Indentation seemed to be important here. I scanned the code, and I found the little function called "should_archive(message)". And sure enough, the crux of the logic which defines whether a message is to be archived away or not, was there.

So I added this line:
if (options.spam_confidence > 0)
and (options.spam_confidence > get_spam_confidence(message)):
return 0
And modified the options class to include the spam_confidence field. Did some modifications on the code to read in the command line options, and then had to create the section which extracts the spam confidence from the message headers. Doing this was relatively easy, because the rest of the code basically does the same things: reading things off the headers and using the information. So my new function looked like this:

def get_spam_confidence(message):
"""Returns the DSPAM_Confidence from the message headers. Zero by default"""
""" 071218 yky Created """

assert(message != None)

for header in ('X-DSPAM-Confidence', 'SPAM-Confidence'):
confidence = message.get(header)
if confidence:
confidence_val = float( confidence )
if confidence_val:
vprint("Spam Confidence: %f " % confidence_val)
return confidence_val

return 0.0
Thats it!

I also set some cronjobs to run against the quarantine file; to kill 88% and above spams every hour, kill 58% spams after 3 days, and kill the rest if they are more than 14 days old.

I then followed up with my corporate responsibility duties, and submitted the patch back to the archivemail project in sourceforge. This didn't take me long, and it is worth while whether they accept it or not. At least the source is available online.

I hope this helps other dspam admins out there too!


1 lewser:

holiday42 said...

Is there a patch for archivemail-dspam that makes it look at the "from QUARANTINE day month dayofmonth time year" datestamp in an dspam.mbox file, instead of relying on the possibly future dated /back dated Date line in the mail header?