This page describes how to make a simple and effective DIY ("Do-It-Yourself") spam filter using Python. (Or rather, one way to do it).
This documentation is fairly basic. If you find this information useful or interesting, please contact me. I am unlikely to update this unless someone asks me too!
This project arose out of frustration with existing spam filter solutions:
Then I discovered that python came with the module email which could parse email messages. So I had a go at writing my own spam filter in python.
And it was easy! The rest of this page describes how I did it. I don't expect you to copy my code - though you're welcome to. The real point of this page is to remind people that python comes with batteries included - including batteries for parsing email messages. And frankly, writing a spam filter in python was easier than deciphering the spamassassin or procmail documentation. And a lot more fun.
I make no grand claims about this spam filter. However I believe it is easier to use than writing procmail rules. It does not achieve the success rates of the likes of SpamAssassin or DSpam, but it allows the user complete control over their spam filtering.
So if you're interested, feel free to download and use my DIY spam filter, which you can download here and which is documented below.
Alternatively, look up the Python email module, and have a go yourself!
The program is divided into three parts:
To use it, you will need to:
First, a few words about my setup. I run a standalone PC with a Linux system, connected directly to my ISP. My username is "pballard".
To get email, I run fetchmail, which gets the mail from my ISP. I run fetchmail manually whenever I feel like getting my mail. (I realise some people like it to poll for mail continually). fetchmail puts the mail in /var/spool/mail/pballard . It then calls procmail, which processes this mail. Usually it is from within procmail that you call your spam filter.
I changed my setup so that procmail would send my email to a place which I could read and write without root privilege. I decided that place would be the file /home/pballard/sysmail/pballard.raw So this is my ~/.procmailrc file:
# send everything to one file
:0:
/home/pballard/sysmail/pballard.raw
I then created a simple program "fm", which runs fetchmail, then invokes my python email filter. Below is the code for fm:
#!/bin/sh
fetchmail
# give time for processing to finish
sleep 2
python /home/pballard/software/python/email_filter.py
I do not recommend calling the spam filter from procmail directly. procmail runs once for each message, so there can be multiple procmails running at once, resulting in multiple calls of the spam filter. This may result in file clashes. In any case, procmail is not needed because the filter does all the file redirection usually handled by procmail.
Using agressive whitelisting, a functional spam filter is not hard to make.
A whitelist is a list of known emailers, emails which you know never send spam (though maybe the occasional virus!). By agressive whitelisting, I mean I ensure that my whitelist is always up-to-date.
An initial whitelist can be created by simply searching all your mail folders.
The whitelist can then be kept up-to-date by separating ham from a whitelisted source ("wham") from other ham (see below).
By using a whitelist, detecting other good emails ("mham") gets easier. To find a good email, you do not have to ask "What does good email look like?", but "What does good email look like when coming from a new sender?".
Wham = Whitelisted Ham. Email from a known recipient (i.e. someone on my "whitelist") is known good.
Mham = Marked Ham. This is email from an unknown recipient, but which contains a string indicating that it is probably good. For instance, on my web page and in my email signature, I ask people to include my first name (Peter) in the first line of a new email to me. I also have a few other strings I check for. For instance, as I'm a chess player, any email with "chess" near the start is probably ham not spam.
Spam, of course, is unwanted email. If an email is not wham or mham, I check it for various characteristics which might indicate it is spam.
Ham traditionally means wanted email, but in my program it denotes undetermined: any email which is not wham or mham, and cannot be positively identified as spam, is called ham. This is sent to a separate folder to the wham and mham. In my experience, nearly all "ham" is actually spam. In other words, if it's not from a known recipient (wham), or contains clear indications that it is ham (mham), it's most likely actually spam.
There actually is a fifth category, called Panic. This is for emails which cannot be parsed by the Python module email.message_from_string(). I get a small number of these using Python 2.2, but they have, without fail, been spams. (Only the spammers are incapable of formatting an email message properly, it seems). I have never had a "panic" when using Python 2.4b.
I have included my spam detection function in the file email_spam.py, but others may want to modify or enhance it. Here are a few hints: