
How does an email filter detect spam? Email filters detect spam by looking at the sender’s information, certain words in the email content, the structure of the email, and user engagement. Thanks to machine learning, they are much more accurate than they used to be.
Spam filters have been around since the mid 1990s, which was almost twenty years after the first spam email and a few years after these unsolicited emails became known as spam. The name spam comes from a Monty Python sketch where two of the actors are in a restaurant, and everything on the menu comes with spam. The sketch ends with a number of Vikings and all of the actors singing a song whose only lyric is “spam” repeated a lot. Unsolicited emails were becoming like this, so the name spam seemed to fit.
The first spam email was sent in 1971 by a system administrator at MIT. The very first email was also sent this year, which only goes to show how spam has always existed. The first spam email was sent by the administrator to all of the people in the group. However, this was merely a message and nothing like the spam we have all become used to. The first commercial spam email went out in 1978 and was an advert for digital equipment sent to 400 people. Since then, spam has grown almost exponentially. It is estimated that 100 billion spam emails are sent every single day. That is a lot of work for a spam filter. So, how do they do it?
One of the biggest signs that an email is spam is where it has been sent from. Companies that run spam filters have enormous databases of known spam email addresses and IP addresses. Every time you flag an email as spam, that email address and IP address get added to the database for future reference. These databases are known as reputation systems, and any emails from here automatically head to your spam folder. One problem with this is that many spam emails are sent out from slave networks, where a hacker takes over thousands of home computers with a virus and then uses them to send spam. If this happens, an innocent person’s IP address can be added to the database. That is why clicking “this is not spam” helps. Many filters also check whether the sender passes authentication standards like SPF, DKIM, and DMARC. These don’t guarantee an email is safe, but failing them is a strong warning sign, and passing them helps legitimate senders build a positive reputation over time.
Spam filters also scan the content of the email, looking for certain trigger words. These could be words like “winner”, “free”, “urgent”, or “Viagra”. It also scans any images and analyzes the punctuation and the grammar. Excessive use of punctuation, such as exclamation marks, and bad grammar can be a warning sign. The filter looks at the email structure as well as scanning any of the links inside it. As with the IP address, just because an email has trigger words, excessive punctuation, lots of links, and bad grammar doesn’t necessarily mean it is spam, but it is a good indicator.
Email filters also monitor user engagement across millions of users. If most users who get emails from a certain address, or containing a certain set of trigger words, either don’t open it, or open it and delete it very quickly, or, obviously, flag it as spam, then the filter will remember this.
And this brings us to probably the most important technique that modern spam filters use. They can learn, and they constantly adapt and improve. They have a massive amount of experience with emails, and that increases every day.
However, all any spam filter can do, really, is to guess. To do this, a lot of them use Bayes’ theorem, which was invented by the statistician Thomas Bayes. This theorem is a way of working out the probability of something based on knowledge of the specific conditions that go into it. It can be used for predicting who is likely to get diseases, and the probability that someone who tests positive for cancer actually has it. In a spam filter, weight is added to all of the categories we have mentioned, and more besides. The total of all these categories gives an overall probability of whether or not the email is spam or not.
Spam filters are very good these days, and the majority of us don’t even open spam. So, you might think, why do people still send spam? The answer lies in the cost and the volume. Sending millions of spam emails costs almost nothing to do and takes almost no time because it is automated. They get the address by either scraping the internet or buying mailing lists. They don’t need many people to click on their emails to make money. One study ran an experiment to see how many people respond to spam. They used bots to send out 350 million emails about a pharmacy campaign. Of those emails, about 250 million were blocked, and 82,700,000 arrived. From those, 10,522 people visited the site. And from those, 28 people made a purchase. That is a result of 0.0000081% of the original 350 million. However, if each of those 28 people spends $100 and your spam campaign costs you $10, it is worth it. And this is what I learned today.
Sources
https://abnormal.ai/glossary/email-filters
https://www.sagenetcom.com/blog/how-email-spam-filter-works
https://en.wikipedia.org/wiki/History_of_email
https://www.hornetsecurity.com/en/knowledge-base/bayesian-filter
https://en.wikipedia.org/wiki/History_of_email_spam
https://en.wikipedia.org/wiki/Bayes%27_theorem
https://en.wikipedia.org/wiki/Email_spam
https://cseweb.ucsd.edu/~savage/papers/CACMSpam09.pdf?utm_source=chatgpt.com
Photo by Torsten Dettlaff: https://www.pexels.com/photo/black-and-gray-digital-device-193003/
