That’s the period for which I have statistics on the spam and non-spam email that arrived at my mail server. Now that I’ve moved to a new cloud-based Exchange server it seems only appropriate to see how much garbage arrived between February 19, 2008 and July 20, 2013 (I was running my own email server for years before then, but I had to re-build my system and lost the earlier records).
Just over 1 million emails hit the server during that time…roughly 7/8ths of which were spam:
[singlepic id=3 w=800]
Because most spam tends to be relatively small, almost 40% of the 19.5 gigabytes of traffic was non-spam (ham):
[singlepic id=2 w=800]
It’s annoying to think that I was paying connectivity charges for all that garbage!
What kept us from drowning in all that junk was an open source program — actually, an ecosystem — called SpamAssassin. In addition to having crowd-sourced rules for detecting spam, it also has a Bayesian analyzer that you train to identify “spamminess” based on how you categorize the emails you get (you have to do a lot of training when you first adopt SpamAssassin, but the workload drops off very quickly in a couple of weeks). Here’s a breakdown of the types of SpamAssassin rules which got triggered by our email:
[singlepic id=4 w=800]
“Reputation Lookup” refers to all the various websites which keep lists of spammers. “Format” refers to errors in the data structure of the email (spammers, in addition to being vile human beings, are apparently also somewhat sloppy :)). “Source” means “who sent the email”, while “Content” means the actual content of the email. While the Bayesian analyzer also looks at content, it’s interesting that so little of the spam detection is based on what the spammer hopes we’ll read. It’s really much more about structure and patterns (e.g., of the email, how it was delivered, etc.).