Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

6 January 2009

Starting simple with Python - CSV to SQL

Well, time to learn a new language, and what better than Python? Here's my first attempt at it, to solve a very simple admin task I had to do, which was to change a csv file to a series of SQL updates.

There's just a small bit of logic required; to skip over empty fields, and sanitizing the input, so overall its a really simple piece of code. Ive added some stuff like command like arguments, but of course havent written error checking, nor usage help. What is strange about Python is that its really sensitive to what sort of indents you use. Tabs are not the same as spaces, so please be very careful, else be wary of this infamous error:
IndentationError: unindent does not match any outer indentation level
Here's the code, which I'm sure I'll be referring to in the near future.


import sys
import csv

destinations = csv.reader( open( sys.argv[1], "rU" ) )

for data in destinations:
if (data[1] != ""):
country = data[0].replace( "'", "\\'" )
print "update Country set ShipMethod = '" + data[1] + "' where CountryName = '" + country + "'; "


yk

18 December 2007

Making Archivemail work with DSpam

Ive got an dspam "appliance" where the enterprise emails filter through. I've set it up so that only one dspam user is used to filter all the emails. This has worked well over the past few years, but managing it has been quite a chore. Every morning, I'd have to wade through the emails in the quarantine (about 15K), and free up any False Positives which were caught.

Beyond the 58% spam confidence as reported by DSpam is pretty much spam. Below that, between the 47% - 57% there may exist one or two False Positives.

After freeing them up, deleting the remaining emails is a huge chore, because the DSpam UI will not allow deleting the quarantine file when new spam pops in.

So I needed a little program which would scan the quarantine mbox file and delete off any messages which are 58% or higher spam confidence.

I tried the most obvious program called 'archivemail', which was readily available in all distros, but was disappointed that it only allowed filtering on the messages age. There was a mysterious "Filter" switch but it only applied to IMAP mailboxes.

The great thing about this is that archivemail, like the entire emailling stack on my servers, is its completely Free Software. I just had to invest some time to look at the code. archivemail lived in /usr/bin/. I had a look at the file, and its a very small 1500 line python script!

I haven't programmed in python before, but looking at the code, it didn't look too scary. It had classes, but no colons. Indentation seemed to be important here. I scanned the code, and I found the little function called "should_archive(message)". And sure enough, the crux of the logic which defines whether a message is to be archived away or not, was there.

So I added this line:
if (options.spam_confidence > 0)
and (options.spam_confidence > get_spam_confidence(message)):
return 0
And modified the options class to include the spam_confidence field. Did some modifications on the code to read in the command line options, and then had to create the section which extracts the spam confidence from the message headers. Doing this was relatively easy, because the rest of the code basically does the same things: reading things off the headers and using the information. So my new function looked like this:

def get_spam_confidence(message):
"""Returns the DSPAM_Confidence from the message headers. Zero by default"""
""" 071218 yky Created """

assert(message != None)

for header in ('X-DSPAM-Confidence', 'SPAM-Confidence'):
confidence = message.get(header)
if confidence:
confidence_val = float( confidence )
if confidence_val:
vprint("Spam Confidence: %f " % confidence_val)
return confidence_val

return 0.0
Thats it!

I also set some cronjobs to run against the quarantine file; to kill 88% and above spams every hour, kill 58% spams after 3 days, and kill the rest if they are more than 14 days old.

I then followed up with my corporate responsibility duties, and submitted the patch back to the archivemail project in sourceforge. This didn't take me long, and it is worth while whether they accept it or not. At least the source is available online.

I hope this helps other dspam admins out there too!

yk.