Technology on my desktop: spam or ham?
A new approach to dealing with spam in your inbox.
Do you enjoy spam? I don't mean the canned meat, written with a capital letter, but the type that offers you a fortune in Nigeria, a genuine imitation Rolex, a night of bliss with your favourite Miss or even the latest gadget from a company you bought a gizmo from last decade.
Let's face it, it's pestilential. Various companies have come up with schemes to filter out the spam before you even see it, with varying degrees of success. Unfortunately, the spammers always seem to be one step ahead of the anti-spammers, whether with blacklists, whitelists, Bayesian filters, requests for confirmation (whoever thought that one up?) and so on.
The surest way of not being bothered with spam is either not having an email address or, if you do, never looking at your inbox. That doesn't stop snail-mail spam or, worse, phone/SMS/MMS spam, but one cannot live entirely in a cocoon.
Would you believe it if I told you that various misbegotten persons send me an average of about 800 messages per day, vaunting their pirated software, pump and dump scams, their little blue pills or whatever? Of course!
Would you believe it if I said that it does not particularly bother me? Somewhat less credible, but it's true.
There is even greater news, although I'll keep that secret until a little later in this article. Let me say, just to whet your appetite, that I have a spam filter that is well over 99.9 per cent accurate, false positives and false negatives included. I make, on average, one correction every two weeks and never lose a wanted message (scanning through my spam mailbox for the odd one takes me only about 30 seconds per day).
The first thing to do is to change your email address whenever the spam becomes too obtrusive, usually after a year or so.
Of course, you have to tell your contacts to change their address book entry. To avoid this being dumped as spam, use the Bcc feature for a mass mailing (see box). This is very effective, especially if you keep the old and new addresses for incoming mail active for a month, with a reminder in the signature line for outgoing mail. The last time I did this, everyone complied.
After a month, you should configure your Internet Service Provider's mailbox system to dump all mail directed to your old address. Never bounce to spammers, just leave them in limbo! Use different addresses when you subscribe to netlists so that you can see which ones provoke spam (Yahoo Groups are notorious for this!) and change the bad'uns from time to time.
Now let's talk Bayesian. An 18th century clergyman, the Rev Thomas Bayes, improbably came up with a mathematical theory on the probability of events. Little did he know that spam would make him famous three centuries after he was born.
Most email clients incorporate spam filters in their software and these are usually partially Bayesian, but they are not brilliant at their job. You can buy more complex Bayesian filters, but they are not usually much better.
A Bayesian spam filter makes a list of words, called a corpus, which it encounters and calculates each one from its frequency of occurrence with a figure of probability, whether it is likely to be spam or ham. On reading an email, it calculates an overall probability from the words it knows in the corpus and deals with it accordingly. The question is where does it pick up the words it parses and which ones should it ignore? Usually, it does it in the To, From and Subject lines, although the better ones also may do it from the Body, if it is in plain language and not in a graphics format. Very few glean data from the full Header, yet this is often invaluable because of the routing and various domains that are found in it.
If the body in plain language and HTML, including graphics, is parsed as well as the complete header, then it is possible that we may have sufficient information to do a lot more than decide which words are likely to be spam or ham. This was an idea that John Graham-Cumming, currently a researcher at Electric Cloud Inc, had a number of years ago. He thought that emails could be tagged into, for example, spam, personal, work, hobbies and other 'buckets', which would be used for sending the messages into ad hoc mailboxes. This would make the corpus more complex, turning it into a two-dimensional array.
Boxes and buckets
This kind of corpus may be well developed after a few hundred messages have been evaluated, the rows always adding to unity. Each time a correction is made, the probability for each word in the total message is re-evaluated for each bucket.
Using the above very fictitious micro-corpus, let's imagine I receive an equally fictitious email from the IET stating that I snapped a complete steak with my camera. The scores would be: spam, 0.111266; personal, 1.657527; work, 1.179555; hobbies, 0.923703; others, 0.127949. It would therefore decide that chance would deem it to be a personal message with a percentile probability of 41.43, over 29.49 for the work bucket. In reality, the probability with a complete corpus and analysis of Header, To, From, Subject and Body would normally exceed 90; at less than an optional given value, the message may be an unclassified category.
Does it work well in practice? The answer is an unqualified yes. I subscribe to a help netlist and to a beta testing team concerning the same software, with similar vocabulary and some correspondents in common. It distinguishes accurately which messages go into my netlist bucket and which go into the beta bucket.
Incidentally, I have ten buckets which are sorted into mailboxes by an invisible X- classification by my email client (Mozilla Thunderbird), but there is no reason why you cannot have anything between two and about 50 buckets. Over time, I have developed 157,988 separate words in my corpus, including tags, domains and gobbledegook like C1AL1S. The overall percentile accuracy for all ten buckets is 99.78 for the last 16,387 messages (I reset the statistics just a month or so ago), of which two were spam false positives and one a spam false negative, giving a spam accuracy of 99.982. I bet there aren't many utilities that can beat that!
Nothing in this world is perfect, least of all software. The main disadvantage is common to all Bayesian spam filters; it has to learn from its mistakes. When you first use it, all the messages are unclassified and you have to manually put each of them into one of your buckets. After about 100 emails, it will be sorting with an 85 to 90 per cent accuracy, so you have to classify only about one-tenth of your messages. By 1,000 messages, your accuracy should be higher than 99 and it will continue to increase from there, provided you sort the 'falsies' out conscientiously.
So, what is this wonder program and how much does it cost? Called POPFile, it doesn't cost you even a bent kopeck, because it is open-source for Windows, Mac and Linux. You can find it at http://getpopfile.org/ [new window].
It has been several years in the making by a very dedicated team of developers, through hundreds of iterations. Only recently have the developers considered it to have reached maturity with version number 1.0.0, although it was already winning awards in 2003.
I emphasise that my involvement is only as a very happy bunny (user) for the past five years.