Like so many other industries, the IT industry has seen its share of fads over the years. One of the most prevalent fads of the moment is AI and machine learning. The last year has seen machine learning capabilities integrated into everything from anti-malware software to high-performance storage arrays. Now please don’t misunderstand me. I’m not trying to suggest that this sudden, ubiquitous inclusion of machine learning in all manner of IT products is a bad thing. Machine learning certainly has its place, and while the use of machine learning may be overkill for some products, machine learning has allowed many IT products to be vastly improved. Even so, one has to question the possibility of whether reliance on machine learning can become something of an Achilles’ heel? But can machine learning poisoning cause it to start making bad decisions?
What is machine learning poisoning
Machine learning and AI systems are really nothing more than algorithms that learn to make decisions based on the data that they ingest. Machine learning poisoning is a way of intentionally feeding a machine learning algorithm bad or misleading data in an effort to get it to make poor decisions.
In some ways, machine learning poisoning is a lot like taking an online survey and answering the questions in a really sarcastic way, or outright lying. Such behavior can be a source of amusement for the survey taker, but if enough people were to engage in this sort of behavior then it could completely invalidate the survey results. Whoever commissioned the survey might even end up making some bad business decisions if they believe that the survey results are credible.
As I ponder the similarities between machine learning poisoning and manipulating survey data, I can’t help but to be reminded of an old Calvin and Hobbs comic strip from 1995. Here is the dialog:
Calvin: “I’m filling out a reader survey for a chewing [gum] magazine…. See, they asked how much money I spend on gum each week, so I wrote ‘$500.’ For my age, I put ’43,’ and when they asked what my favorite flavor is, I wrote ‘garlic / curry.’ ”
Hobbs: “This magazine should have some amusing ads soon.”
Calvin: “I love messing with data.”
The Calvin and Hobbs comic strip from so long ago was, of course, intended to be light and funny. Even so, it underscores the idea that business decisions are often based on data. Simply put, bad data can lead to bad decisions. It doesn’t matter if those decisions are being made by a human or by a machine.
Consequences of machine learning poisoning
Okay, so now that I have spent some time talking about what machine learning poisoning is, I want to talk about the potential consequences of a machine learning poisoning attack. Unfortunately, it is impossible to discuss the potential consequences in a general sense, because machine learning is used for such a huge variety of purposes. After all, an attack against the machine learning algorithm used in a data backup application would have very different consequences from an attack against the machine learning algorithm that is being used by a driverless car. In one instance, the attack could lead to data loss. In the other instance, the attack could lead to loss of human life.
What is interesting about machine learning poisoning, however, is that there can be varying degrees of poisoning. To illustrate what I mean, let me give you an example from the human world.
Imagine for a moment that you were able to calculate the exact amount of cyanide required to kill a particular person. If that person were to ingest the cyanide, it would of course be lethal. If on the other hand, the same person were to ingest 1 percent of the calculated minimum lethal dose, then the effects would not be lethal. The person who took the cyanide would likely become ill, and they might even be permanently damaged, but they would not die because they did not consume a sufficient quantity of the poison.
This same basic concept also applies to machine learning poisoning. Algorithms can vary widely, but generally speaking a small amount of bad data probably isn’t going to cause the machine learning algorithm to consistently make bad decisions. In fact, there are algorithms that are designed to ignore outliers. If a little bit of bad data were to make it into such an algorithm, then that data would probably be dismissed as an irrelevant outlier. On the other hand, a machine learning algorithm that is fed a steady diet of bad data is going to start behaving in an unintended way. After all, machine learning algorithms are designed under the assumption that the vast majority of the data that they receive is trustworthy, and those algorithms make decisions based on that data.
Can machine learning poisoning actually happen?
Of course, the real question is whether machine learning poisoning can actually happen. Let me begin by saying that I don’t expect to ever see the emergence of general-purpose malware that is designed to poison any machine learning algorithm that happens to encounter. Machine learning systems are simply too different from one another for that to happen. Someone could conceivably create malware that is designed to attack the machine learning engine within a particular application, but a general-purpose attack against any and all machine learning applications seems very unlikely.
But let’s suppose that someone did create a piece of malware that was designed to poison a specific applications machine learning engine. How might such an attack work?
There are two basic processes used by machine learning. The first is the training process. This is where data is used to train the machine learning engine to recognize a particular condition. For example, right now I am using a speech recognition software application that has been trained to recognize my speech patterns.
While it may be possible to attack a machine learning engine during the training phase, such an attack would require an immense amount of data, and it would probably also be time-consuming. Such an attack would probably be recognized and thwarted prior to completion.
The other high-level process used by machine learning is called inference. Inference refers to the machine learning engine making a decision based on the training that it has already received. The easiest way to perform a machine learning poisoning attack would probably be to overwrite the existing training data with poisoned data, thereby causing a breakdown of the inference process. This approach would free the attacker from having to completely retrain a machine learning engine. Instead, the engines existing training data is simply being overwritten with something else.
The takeaway: Watch your permissions
I have not yet heard of any large-scale machine learning poisoning attacks occurring in the wild. Even so, I think that it would be relatively easy for someone to develop malware that would target a specific, machine learning-enabled application, and then replace that applications training data with data that has been engineered to make the application behave in an otherwise undesirable way. As such, it is going to be increasingly important for organizations to carefully manage the permissions that are given to applications and application data.
Featured image: Shutterstock