Software I've found that I think is particularly useful or interesting.

Use any software mentioned here at your own risk.

CFS Home



8/01/2004: SpamBayes [>> Go]

    In 2003, my work e-mail address somehow got onto spammers' lists and I started getting junk e-mail for Viagra, home mortgages at 0% interest, and all sorts of other stuff. At first, it was easy enough to delete the one or two e-mails a day I would get. But after a couple of months the junk mail outnumbered my regular work mail, and I was spending too much of my time trying to look for the non-junk e-mail messages from co-workers and clients. Then someone I work with told me about SpamBayes, and my life got a whole lot easier.

    SpamBayes has, at its heart, a program that uses statistical analysis techniques, primarily a version of naive Bayesian classification, to attempt to classify documents (e-mail messages) into a set number of categories. In this case, the the categories are "spam", "not spam", and "maybe spam". The interesting thing about Bayesian classification (also known as "Bayesian learning") is that it requires, and benefits from, being trained. It isn't called "naive" for nothing - there are no built-in rules for a Bayesian classifier that tell it anything about the source data (in this case, e-mail). So rather than have you (or some company) build a huge list of keywords to filter on (like "Viagra" and "V1@gr@"), with SpamBayes you "teach" it how to tell junk mail from legitimate mail. For every good and bad e-mail it trains on, it analyzes the different words (technically string tokens, which include things like HTML markup and header data) in the message and modifies its internal configuration to help it do better next time. What this means, however, is that the junk mail that you receive and the non-junk mail you get need to have some degree of divergence in the types of words and other data that they contain. Fortunately for now, most junk e-mail contains a vastly different vocabulary than what you might get in e-mails from work or friends, so there's a good chance that a Bayesian classifier like SpamBayes will work well for you.

    I use the Microsoft Outlook plug-in version of SpamBayes, which was easy to install and use and should also work with Outlook Express. There is also a way to install SpamBayes between a POP3 client and your POP3 e-mail server, allowing you to use it with other e-mail clients like Eudora, Netscape/Mozilla, Popcorn, etc., though I haven't tried this myself. For the Outlook plug-in, I just downloaded and installed the file provided for Windows. During installation, SpamBayes wanted to know which Outlook folder I wanted it to watch (I chose my "inbox", where all my incoming mail goes), which folder to use to store junk mail ("spam"), and which folder to use to store "maybe spam" mail ("spam maybe"). It then wanted to know if I had particular mail folders that already had junk mail and non-junk mail (that I had separated out myself) to train on. Fortunately, I had been keeping junk mail in a separate folder (rather than just deleting it), so I had a cache of about 1000 junk mail messages (and a couple thousand non-junk messages) for SpamBayes to use to train itself on. If you don't have a collection of junk messages already, that's fine; SpamBayes will learn as new mail comes in and you mark it as junk or non-junk. It took some time to train on the mail I already had, and then the next time I opened Outlook, SpamBayes had added a new toolbar to the top of my screen (see screenshot above). When I'm in my "inbox", it gives me a button labeled "Delete as spam". When I'm in the "spam" folder, it changes to "Recover from spam". And when I'm in my "spam maybe" folder, I get both buttons. The buttons are how you move mail between the spam and non-spam folders while telling SpamBayes how it should have classified the e-mail (which, in turn, allows it to train itself on the message). As you can see from the screenshot above, I've also added a column to the "spam" folder display to show the "spam score" for each message (the higher the number, the more likely it is junk mail). In the SpamBayes settings, you can specify the thresholds for each classification - what "spam score" should go into your inbox and what score should go into your "spam" folder; everything else goes into the "maybe" folder. (I've got mine set at the defaults: 15% or lower = not junk, 90% or higher = junk, and anything in-between goes into the "maybe" folder.) Once SpamBayes is installed, it will watch for incoming messages to the folder(s) you specify, and attempt to classify them, moving them to the appropriate folder (e.g. moving messages it thinks are junk mail into the "spam" folder). If it classifies something incorrectly, you just select one or more messages you want to re-classify (e.g. select the 5 junk-mail messages in your inbox) and click the appropriate button; SpamBayes uses the messages to train itself and do better next time. Especially at first, you should keep a close eye on your "spam" folder, so that anything in there that shouldn't be can be re-classified as non-junk; over time, you should find fewer and fewer of these (I don't even check my "spam" folder anymore, though there's always a risk that something got in there that shouldn't).

    I currently get between 150 and 200 junk e-mail messages a day at work, but I don't see any of them; it has been months since one showed up in my inbox. And the number of "false negatives" I get (when SpamBayes classifies a non-junk item as junk mail) is very small; maybe one or two a week. These always show up in my "maybe spam" folder; whenever I see that there are new messages in there, I take a quick look and use the SpamBayes buttons to put them in the right place. If you have a problem with junk mail and you use Outlook or Outlook Express as your e-mail client, SpamBayes is an easy and free way to try to get it under control. It doesn't stop the junk mail from getting to your computer, but it does stop you from seeing it and getting in the way of your regular mail. I do wish that it had a whitelist/blacklist feature, however, so that I could expressly allow certain senders to always (or never) be classified as junk mail. When I e-mail myself a quick note from home, for example, the brevity of the message often causes SpamBayes to toss it into the "maybe" folder. But without SpamBayes, my only other alternative at work would have been to have the network manager change my e-mail address, which would have wreaked havoc with all of the clients I correspond with.

©2017 Tyler Chambers