DIY Spam Filtering Using Python

This page describes how to make a simple and effective DIY ("Do-It-Yourself") spam filter using Python. (Or rather, one way to do it).

This documentation is fairly basic. If you find this information useful or interesting, please contact me. I am unlikely to update this unless someone asks me too!

Introduction

This project arose out of frustration with existing spam filter solutions:

Then I discovered that python came with the module email which could parse email messages. So I had a go at writing my own spam filter in python.

And it was easy! The rest of this page describes how I did it. I don't expect you to copy my code - though you're welcome to. The real point of this page is to remind people that python comes with batteries included - including batteries for parsing email messages. And frankly, writing a spam filter in python was easier than deciphering the spamassassin or procmail documentation. And a lot more fun.

I make no grand claims about this spam filter. However I believe it is easier to use than writing procmail rules. It does not achieve the success rates of the likes of SpamAssassin or DSpam, but it allows the user complete control over their spam filtering.

So if you're interested, feel free to download and use my DIY spam filter, which you can download here and which is documented below.

Alternatively, look up the Python email module, and have a go yourself!

Setting Up

The program is divided into three parts:

  1. the mail handler program (email_filter.py);
  2. the actual program which decides whether a message is spam or not (email_spam.py);
  3. local definitions (email_defs.py).

To use it, you will need to:

  1. Have Python 2.2 or greater (Python 2.1 might work, but I haven't tried).
  2. Download the source code.
  3. Edit email_defs.py to suit your local environment.
  4. Move the sources to a place in their Python path.
  5. Modify your environment so that email goes through the filter. How I did it is described below.

Setting up so that email goes through the filter

First, a few words about my setup. I run a standalone PC with a Linux system, connected directly to my ISP. My username is "pballard".

To get email, I run fetchmail, which gets the mail from my ISP. I run fetchmail manually whenever I feel like getting my mail. (I realise some people like it to poll for mail continually). fetchmail puts the mail in /var/spool/mail/pballard . It then calls procmail, which processes this mail. Usually it is from within procmail that you call your spam filter.

I changed my setup so that procmail would send my email to a place which I could read and write without root privilege. I decided that place would be the file /home/pballard/sysmail/pballard.raw So this is my ~/.procmailrc file:


# send everything to one file
:0:
/home/pballard/sysmail/pballard.raw

I then created a simple program "fm", which runs fetchmail, then invokes my python email filter. Below is the code for fm:


#!/bin/sh
fetchmail
# give time for processing to finish
sleep 2
python /home/pballard/software/python/email_filter.py

I do not recommend calling the spam filter from procmail directly. procmail runs once for each message, so there can be multiple procmails running at once, resulting in multiple calls of the spam filter. This may result in file clashes. In any case, procmail is not needed because the filter does all the file redirection usually handled by procmail.

How the Program Works

Agressive Whitelisting

Using agressive whitelisting, a functional spam filter is not hard to make.

A whitelist is a list of known emailers, emails which you know never send spam (though maybe the occasional virus!). By agressive whitelisting, I mean I ensure that my whitelist is always up-to-date.

An initial whitelist can be created by simply searching all your mail folders.

The whitelist can then be kept up-to-date by separating ham from a whitelisted source ("wham") from other ham (see below).

By using a whitelist, detecting other good emails ("mham") gets easier. To find a good email, you do not have to ask "What does good email look like?", but "What does good email look like when coming from a new sender?".

Potential Problems with Whitelisting

Ham, Spam, Wham and Mham

The key to the filter is to put each email into one of four (actually five) different classifications:

Wham = Whitelisted Ham. Email from a known recipient (i.e. someone on my "whitelist") is known good.

Mham = Marked Ham. This is email from an unknown recipient, but which contains a string indicating that it is probably good. For instance, on my web page and in my email signature, I ask people to include my first name (Peter) in the first line of a new email to me. I also have a few other strings I check for. For instance, as I'm a chess player, any email with "chess" near the start is probably ham not spam.

Spam, of course, is unwanted email. If an email is not wham or mham, I check it for various characteristics which might indicate it is spam.

Ham traditionally means wanted email, but in my program it denotes undetermined: any email which is not wham or mham, and cannot be positively identified as spam, is called ham. This is sent to a separate folder to the wham and mham. In my experience, nearly all "ham" is actually spam. In other words, if it's not from a known recipient (wham), or contains clear indications that it is ham (mham), it's most likely actually spam.

There actually is a fifth category, called Panic. This is for emails which cannot be parsed by the Python module email.message_from_string(). I get a small number of these using Python 2.2, but they have, without fail, been spams. (Only the spammers are incapable of formatting an email message properly, it seems). I have never had a "panic" when using Python 2.4b.

Data Flow

  1. Identify email as either panic, wham, mham, spam or ham.
  2. Have one folder for each, and send email to the appropriate folder:
  3. In addition, all emails are also sent to a storage file (storefile). This is useful for testing, but also for retrieving the small number of "false positives", i.e. good emails which are incorrectly identified as spam. Once this file gets up near 600MB, I write it to a CDROM and then delete it.

A few hints on Spam Detection

I have included my spam detection function in the file email_spam.py, but others may want to modify or enhance it. Here are a few hints:


Back to Peter's Home Page

Contact Details