Deliberate attacks are a distinct category of security and privacy risks to an AI system. They might be carried out by adversaries ranging from private individuals and criminal organizations to state-level actors. The motivation of an attacker may arise from various factors. Some examples could be that an attacker may be driven by the challenge of successfully compromising a system or exploiting vulnerabilities – and perhaps gaining recognition within the hacking community. Simultaneously, attackers may seek financial gains through activities such as stealing sensitive information, conducting ransomware attacks, or engaging in identity theft for monetary benefits. Political motivations can also drive attackers to target specific networks to disrupt critical infrastructure or influence public opinion.
AI systems can be targets of most of the same cyberattacks carried against any other software system. In addition, they have a set of unique vulnerabilities that can be exploited by attackers.
If an AI model was a guard dog
Imagine you want to protect the entrance to a room. You do it with a lock and a key, or you can have a trained guard dog. Breaking in through a key and lock is similar to breaking and exploiting a traditional software system, while tricking a guard dog is similar to attacking an AI model.
You can open the lock with specialized lockpicking tools, or just by trying to use brute force. However, with a dog, you can try other approaches. You can dress up like a scary monster to try and frighten the dog, or maybe you can use a few steaks to retrain the dog to recognize you as its owner. You can even look at how the dog reacts to different people to identify its owner, and then force the owner to give you access. The point is, when dealing with a complex algorithm that learns from the data, you can exploit its complexity to gain an advantage.
By the way, the dog examples above are somewhat equivalent to the different types of attacks that can be used on AI systems. The scary monster is like model evasion, the steak is data poisoning, and finding out the owner is membership inference. Understanding these attacks can help AI developers and deployers prepare against them, therefore contributing to AI resilience.
Security Attacks
Security attacks include data poisoning and model evasion, which occur at two different stages of the AI pipeline. In general, data poisoning aims to corrupt the training process phase, while model evasion focuses on manipulating inputs when the AI system is making predictions in the deployment.
Data poisoning
A data poisoning attack begins when adversaries, also known as "threat actors", somehow gain access to the training dataset and have the ability to contaminate the data by modifying entries or introducing tampered data into the training dataset. For example, the adversaries could fool a model that has been trained using a big collection of dog images by injecting a white box into each of those images. Consequently, the model would inherently associate any image featuring a white box as a representation of a dog. Exploiting this vulnerability, an adversary could deliberately include a cat image with a white box, leading the model to mistakenly classify it as a dog image at a later stage.
Another type of data poisoning attack is label flipping which is a type of adversarial attack that aims to manipulate the output of a machine learning model by changing the labels of some of the training examples used to train the model. In a typical supervised learning scenario, a machine learning model is trained using a labeled dataset, where each data point is assigned a correct label. An attacker can introduce nonsensical or purposefully wrong examples in a dataset if they know the algorithm is learning from that data, for instance, by leaving one-star reviews with very positive wording on Amazon to confuse a sentiment analysis algorithm. This could cause the model to learn incorrect associations between data features and labels, resulting in incorrect predictions on new data. Misclassifications resulting from data poisoning attacks can have serious implications in crucial fields such as healthcare and autonomous systems. The potential consequences include inaccurate medical diagnoses, which can endanger patient well-being, as well as misidentification of objects in autonomous vehicles, posing risks of accidents and harm.
Model evasion
Model evasion attacks are among the most common types of attacks targeting the inference process of machine learning systems. They exploit misclassifications using well-crafted malicious inputs, so-called adversarial examples, to confuse machine learning models into making an incorrect prediction. As shown in a seminal work from Goodfellow, Shlens, and Szegedy (2), after adding a minimum amount of noise to the testing data, which is imperceptible to a human, the classification result has been successfully compromised from a “panda” to a “gibbon” with surprisingly high confidence (99.3%). Confusing a panda to a gibbon might not yet be a big worry, but when you consider a situation where an attacker takes control of electronic signs, especially stop signs and speed limit signs, the consequences of such attacks can be devastating. Those attack patterns don't even require extra noise, but small changes to a stop sign may be enough for the network to recognize it as a "50 km/h" speed sign (3). The above example highlights the vulnerability of AI models to security attacks, as even minor modifications to the input data can lead to significant errors in the model's predictions. The consequence of such attacks is the erosion of trust in AI systems and potential negative impacts on various applications relying on these models.
Privacy attacks
Privacy attacks commonly include a ‘data inference attack’, which analyzes outputs from machine learning models to obtain information about data is used to train the models. Two main types of inference attacks exist, ‘membership inference’ and ‘attribute inference attack’.
Inference — a word with two meanings in AI
The Oxford Dictionary defines inference as “something that you can find out indirectly from what you already know”. In the machine learning context it’s used to describe two related but distinct things:
The prediction, classification, decision, or other output generated by a machine learning model once it’s deployed.
A privacy attack seeking to learn something about a machine learning model’s training data based on its output (and other data).
Attribute inference
Attribute inference attacks assume a situation where an attacker already has partial knowledge about a data record (for example, public data) and tries to infer the information of the missing attributes (for example, private data). The attacker can repeatedly query the target model with different possible values of a missing attribute and analyze the outputs to find out the value that is indeed in the corresponding record of the training data set.
For example, let's consider an AI-driven application that offers personalized services, such as targeted advertisements, to users based on their publicly available data like age, gender, city, and country, as well as their private personal data, including their home address. In this scenario, an attacker could gather the public data and submit multiple queries with varying values for the private data. By analyzing the responses received from these queries, the attacker can deduce the actual values of the private data, thereby compromising the privacy of the users.
Membership inference
Membership inference attacks assume a situation where a record (for example, an entry about a person) and access to a machine learning model is given (for example, as black-box model API). The attack tries to identify if the record is included in the training data used for the development of the model. The basic idea of the attack is to learn the difference of the target system’s behavior with inputs that were already seen in the training data set and from the inputs from unseen data.
Researchers have shown that this attack type is theoretically possible, but there’s limited evidence of it happening in real life so far. The risk of a membership inference attack is higher for overfitted models (following detailed variations in training data too closely) and when unlimited access to model predictions is allowed.