While speaking of spam
I have to say that WP-Hashcash has done the trick. I’ve gotten fewer than 2 dozen spams in the last 2 days, and the log shows almost 1500 blocked attempts… sweet!
So while on the topic of spam, Scotty asked me today about fighting e-mail spam more effectively. Even though I’ve used SpamAssassin for a couple years, I would still get a few hundred spam e-mails a week that made it through the filter. Thunderbird would collect about 90% of those, which meant that I didn’t have to look at a lot of spam… but still far more than I would have liked. Plus, any time I looked at my mail from a webmail interface, I had to sort through a few dozen spams.
I read up a couple months ago on SpamAssassin tweaking. Let me summarize some steps I took that caused the number of spam e-mails I get a week to fall to under 10.
1) Training Camp
The quick, two-line explanation of how SpamAssassin works: it looks for various signs of a spam e-mail (the word “Viagra”, e-mail sent to a large number of recipients, spoofed headers, etc.) and for each one it finds, it adds a varying number of “points” to the e-mail based on how spam-like the behavior is. If an e-mail gets too many points, it’s spam! Most SA installations learn by default what “strong spam” and “strong ham” are (those with very high positive and negative scores) and thus are able to adapt to your own personal spam preferences. However, it’s that remaining 20% of e-mails, the “weak spam”, with which you need to lend SpamAssassin a hand.
The sa-learn command allows you to show SpamAssassin additional examples of spam it didn’t catch the first time around, thus improving its Bayesian filtering. I use this small shell script to train it every so often:
#!/bin/sh
echo "Learning from Ham..."
sa-learn --ham --progress --mbox ~/mail/seth/inbox
echo "Learning from Spam..."
sa-learn --spam --progress --mbox ~/mail/seth/Junk
You need to train equal amounts of ham and spam for optimum results, according to the manual. So I train SA for ham on my inbox, and spam on Thunderbird’s Junk folder.
2) You make the rules
Most e-mail programs let you view the full e-mail headers, where you’ll find a line like this:
X-Spam-Status: No, score=4.9 required=5.0 tests=ADDRESS_IN_SUBJECT,BAYES_50, NO_REAL_NAME autolearn=no version=3.1.7
The three tests in the header are the ones this particular message failed. It included an e-mail address in the subject line, the Bayes Filter gave it a 50% probability of being spam, and the e-mail address it was sent from did not include a name to go with the address. This gave it a score of 4.9. To be marked as spam, it needs a score of 5.0. In this case, the message was not spam, so SpamAssassin was correct.
The scoring system on a default SpamAssassin installation is very different. It’s a lot harder for e-mails to be marked as spam by default, because SpamAssassin wants to play it safe and make sure it doesn’t trash any of your real e-mails. This is a good goal, but if you’ve been training it for a few weeks then it’s much safer to bump the scores for some of the bad tests higher.
Don’t change these scores until you’ve looked at a good sample of your own mail to see how SpamAssassin is scoring it. If you subscribe to some newsletters you like, make sure they’re not being marked as spammy, or you’ll need to take some additional steps to protect them. If the spam you receive isn’t failing any SpamAssassin tests to begin with, it won’t matter how high you raise their scores.
To change the scores of spam tests, edit the file ~/.spamassassin/user_prefs (~/ is your home directory). I looked at some of the spam that was getting through and saw three tests most often:
- FORGED_RCVD_HELO
- HTML_SHORT_LINK_IMG_1
- BAYES_90
(A giant list of all SpamAssassin’s tests is on the website.) These 3 tests would be a good place to start fighting spam, by raising their point value. I added these lines to the bottom of my user_prefs file:
score BAYES_90 5.0
score FORGED_RCVD_HELO 2.0
score HTML_SHORT_LINK_IMG_1 3.5
Since I’ve been training SpamAssassin, if Bayes says there’s a 90% chance of the message being spam, that’s good enough for me. I’ll give it the whole 5 points it needs to be marked as spam. Forged HELOs are still common from places like my university’s mailing system, so I can’t afford to give it a very high score. And messages that are nothing but a linked image are pretty shady, so they get a high score.
3) Quality Assurance
For the first few weeks, take some time to review the spam folder that SpamAssassin dumps all your spam into, to make sure that one of your rules didn’t mistakenly cast its net too far. If you see some of your mail dropping into spam, just move it back to the inbox and train SpamAssassin again. You can also adjust your rules to make sure you don’t weight something too heavily. The spam folder gets pretty large, so you’ll want to clean it out every once in awhile. You can just delete it and SpamAssassin will recreate it the next time you get spammed. I just added a line to my training script above that cleans out /mail/spam while it trains on the other folders.
Following these steps will net you an easy 90% reduction in the amount of spam you get, and over time it will get even better, thanks to Bayesian filtering. Plus, not having to download the mail at all will make your mail reading in Thunderbird or other e-mail clients even faster and more enjoyable.
Thanks, Seth. This helps a lot.