Hardly a day goes by when I do not hear about some fantastic new way in which machine learning is being put to work. I read a fascinating article recently about how machine learning can help doctors spot MRI anomalies that might otherwise go unnoticed. I also read another really interesting article recently that talked about ways in which machine learning was being applied to casino security. These examples serve to illustrate the variety of ways that the technology can be applied. But what I really want to know is this: When will machine learning finally put an end to spam once and for all?
On the surface, the task of spam control seems ideally suited to machine learning. Machine learning can, of course, take many different forms, but the technology is usually based on finding patterns within large datasets. Given the volume of spam that gets sent each day, I think that it is safe to say that the spammers have given us a large dataset from which to perform analytics.
The funny thing about spam is that it is easy to recognize (if you exclude phishing emails that are meant to mimic legitimate messages), but tough to define in a meaningful way. The phrase “in a meaningful way” is key here. Spam is easy to define at a very high level – it’s unsolicited junk email. When it comes to controlling spam, however, that definition is useless, because it does not provide any machine-distinguishable criteria with which to define spam. For a human, the “unsolicited junk email” definition is perfectly adequate. Most people can look at an email message and determine almost immediately whether the message is spam. Computers, however, do not have the same frame of reference as a human, nor do they have the ability to think in the same way that a human does. Currently, computers can only distinguish between spam and other messages based on a series of rules.
The rules that current-generation spam filters use can be surprisingly effective. The spam problem has existed for long enough that spam filters have matured to the point of being reasonably reliable. I receive hundreds of spam messages every day, but out of all that spam, only two or three typically end up in my Inbox on a given day. That’s pretty good.
So is that the end of the story? Not quite. The big problem that spam-filtering vendors seem to be stuck on is that not all spam is created equally. Let me give you an example. Suppose for a moment that I opened my inbox right now and found two messages. Let’s pretend that one message was from the editor of TechGenix, and that she wanted to know when I planned to submit this article. Let’s also pretend that the other message was from an alleged Nigerian prince who wanted to send me $20 million. In this type of situation, one of the messages is completely legitimate, while the other completely fraudulent, and obviously spam.
However, not all email messages can be so clearly defined. Think about the definition for spam that I used earlier “unsolicited junk email”. The problem with this definition is that the term “junk” is completely subjective. What I consider to be junk mail, you may not. Let me give you a couple of examples.
One man’s spam...
One especially common example is email newsletters. As a tech journalist, I get bombarded with tech-related newsletters every single day. Of these newsletters, there are several that I make sure to read. They consistently provide me with useful information that I want to know. Conversely, there are also some newsletters that I consider to be spam, for various reasons. These messages might contain misleading subject lines, more ads than substance, malicious links, or any number of other factors that make me want to avoid wasting my time opening them. The point is that a spam filter cannot simply determine that all email newsletters are legitimate mail, nor can such a filter determine that all email newsletters are spam. A newsletter can easily be either. Furthermore, personal preference also weighs on the determination. I’m sure that some of the newsletters that I read religiously are considered by some people to be spam.
A second example of the subjectivity of junk email comes from email advertisements. I’m not talking about a message containing any editorial content whatsoever, but rather a message that is nothing but an advertisement. While such messages would largely be considered to be spam, it is ultimately personal preference that makes the determination. For example, there is a flight simulation company that emails me when they have released new add-ons that make the simulator more realistic. Even though these messages are nothing more than advertisements, I enjoy looking at them. Similarly, I know people who love getting coupons by email.
The point is that spam control isn’t as easy as it might seem. One person’s favorite newsletter is another person’s annoying spam. So what about using machine learning to control spam? Actually, machine learning has been used for spam control for many years, although it hasn’t always been called machine learning.
Many years ago, Microsoft began to realize that because its Hotmail users were receiving so much spam, the Hotmail mailbox servers were a virtual treasure trove of analytic data. Soon thereafter, Microsoft began a project that was designed to develop a formula for mathematically determining whether or not a message was spam. The resulting mathematical probability came to be known as a spam confidence level.
The spam confidence level does a really good job of flagging the most obvious spam, but it doesn’t really take individual subjectivity into account. I think that’s probably why Microsoft has added a Clutter folder to Outlook. The most obvious spam goes into the Spam folder, things like newsletters usually end up in the Clutter folder, while regular email ends up in the Inbox. However, some spam-like mail, such as advertisements that I actually want to read, are placed directly into the Inbox.
Next-generation spam control
As I said earlier, current spam filtering technology has matured to the point that it seems to work pretty well. Even so, Microsoft’s spam confidence level rating stems from a message classification project that began many years ago. Since that time, sample sizes have grown larger, computing power has vastly improved, and machine learning technology has become far more mature. I suspect that modern machine learning could eventually put a complete stop to spam, if the algorithms are allowed to monitor individual user behaviors.
Of course, a modern anti-spam solution would need to look at more than just the message sender, point of origination, text, and other attributes that are so often examined in an effort to classify email messages. Such a solution would need to be able to perform deep-content scans. For example, optical character recognition might be used to detect image spam. Similarly, an analysis of the way that a user interacts with coworkers might lead to a good solution to meeting spam (meeting spam is spam that is embedded into a meeting request for the purpose of displaying spam on the user’s calendar). That however, opens the door to privacy concerns, and the potential for data to be misused. Eventually, however, I am confident that the war against spam will be won.
Spam evolves, and so should anti-spam technology
The point is that a next-generation spam control solution will need to look beyond the simple message attributes that antispam software has relied upon for so long. Spam evolves, and so too must any anti-spam solution. However, this is where machine learning could really be helpful. A machine learning algorithm could conceivably detect the evolution of spam as it happens, and dynamically adapt its approach to spam filtering in response to those changes.
Photo credit: Shutterstock