I.

Attacks against machine learning systems

In computer science, an attack is defined as “an operation, whether in offense or defense, intended to alter, delete, corrupt, or deny access to a computer data or software for the purposes of propaganda or deception; and/or partly or totally disrupting the functions of the target computer, computer system or network, and related computer-operated physical infrastructure if any; and/or producing physical damage extrinsic to the computer, computer system or network” (1).

An attacker may attempt to exploit a vulnerability of a system to materialize an actual attack. The motivation of an attacker may arise from various factors, including technical, economic, or political considerations. From a technical perspective, an attacker may be driven by the challenge of successfully compromising a system or exploiting vulnerabilities and perhaps gaining recognition within the hacking community. On the economic side, attackers may seek financial gains through activities such as stealing sensitive information, conducting ransomware attacks, or engaging in identity theft for monetary benefits. Finally, political motivations can drive attackers to target specific networks to disrupt critical infrastructure or influence public opinion. To achieve the objectives of an attacker, they can follow characteristics of being harmonized, organized, launching on enormous scales, being regimented, and scrupulously designed with demanding time and resources.

This subsection introduces some typical security and privacy attacks towards machine learning (ML) systems.

Security attacks

One way to harm the functioning of a machine learning system is to attempt to distort the proper functioning of the machine learning model by feeding in malformed data, or adversarial examples, into the model, targeting either the inference step or the model training step of the model lifecycle. Designing and building these types of attacks is often referred to as adversarial machine learning. Let’s look into two examples of this.

Evasion attacks

Model evasion attacks are the most common attacks on machine learning systems. Evasion attacks happen at the inference phase of the machine learning model lifecycle – in other words, when the model is used. Model evasion attacks compromise the integrity of the machine learning model’s predictions. They exploit misclassifications using well-crafted malicious inputs, so-called ‘adversarial examples’, to confuse machine learning models into making an incorrect prediction. Evasion attacks typically aim to cause the model to perform a misclassification while performing minimal modifications to the sample to be misclassified. For example, an attacker can implement an evasion attack to bypass a network intrusion detection system (NIDS) by minimally modifying malicious network packets while preserving their malicious utility and remaining undetected by the NIDS.

Network Intrusion Detection System (NIDS)

NIDS is a security technology designed to monitor network traffic and detect potential malicious activities or unauthorized access attempts within a computer network. It operates by analyzing packets/flows in real time and compares them against a database of known attack signatures, abnormal patterns, and predefined rules. NIDS have the ability to alert the network administrators when suspicious or malicious activity is detected.

NIDS are typically deployed in strategic areas of a network, mostly at gateways, routers, and switches. NIDS can be implemented as standalone hardware or in-software, depending on the organization’s security requirements and policies.

Snort (https://snort.org) is an example of a well-established NIDS. Snort is a NIDS with sophisticated pattern-matching capabilities that are used to uniquely describe attack traffic. It checks for the latest viruses, worms, and other new vulnerabilities.

From an attacker’s point of view, the more information about the target AI-based system that’s available, the higher the chance an attacker can successfully trick AI models. Evasion attacks don’t require any access to the training datasets but require some level of knowledge of the target model. According to the threat model, existing adversarial attacks can be classified into three categories: white-box, grey-box, and black-box attacks, with the main difference being in the knowledge and capabilities of attackers.

In white-box attacks, attackers have complete knowledge of the target model, including model architecture and parameters. In black-box attacks, the knowledge of attackers is very limited by only querying the target model to obtain complete or partial information, thus making the generation of adversarial examples more difficult. In grey-box attacks, the adversaries are assumed to be limited to the structure of the target model. Some evasion attacks against well-known AI models are proposed in the literature, for example, kernel-based classifiers and deep neural networks. 

Data poisoning attack: label flipping attacks

A label flipping attack targets the training part of the model lifecycle. It aims to manipulate the output of a machine learning model by changing the labels of some of the training examples used to train the model. In a typical supervised learning scenario, a machine learning model is trained using a labeled dataset, where each data point is assigned a correct label. An attacker uses a label flipping attack to intentionally change the labels of some of the training examples, causing the model to misclassify future data points. Consider a binary classification problem in which the model has been trained to classify images of cats and dogs. An attacker could change the labels on some of the training set's dog images to cats and vice versa. This could cause the model to learn incorrect associations between data features and labels, resulting in incorrect predictions on new data. Deep neural networks, for example, are particularly vulnerable to label flipping attacks because they aren’t robust to small changes in input. Defending against such attacks can involve a variety of techniques, such as adversarial training, data augmentation, or including randomness in the model's output. We’ll go through these techniques more closely in the next section.

The optimal label flipping attack is the one that finds the subset of poisoned training data that causes the most significant possible damage concerning the performance of the attacked model. This approach assumes a white-box attacker with access to all training data, the model to attack, and all related model parameters. The attackers’ goal is to degrade the attacked model's performance — for example accuracy — as much as possible. In the optimal label flipping attack, the attacker samples all possible poisoned per poisoning rate, and re-trains and re-evaluates the known model for each poisoned permutation. Finally, the attacker selects the permutation that caused the most significant damage and supplies the poisoned training data set to the victim. However, since this approach is a computationally intractable problem, researchers propose a heuristic approach named randomized label flipping attack, which also takes into account a restricted attacker budget. In this approach, the attacker doesn’t evaluate all permutations but just a randomly sampled and limited number of poisoned permutations and subsequently re-trains and re-evaluates the model. The attacker continues with this approach until either a maximum number of iterations are performed (the attacker budget is exhausted) or a satisfying decrease in model performance is measured. Finally, the attacker again selects the permutation with the most significant damage in terms of model performance and supplies it to the victim.

Privacy attacks

A common privacy attack is a ‘data inference attack’, which takes advantage of the information leaked by machine learning models to obtain information about people whose data is used to train the models. Despite the name sounding like it has to do with the inference phase in the model lifecycle, inference, in this case, refers to the attacker being able to infer things about the data used in training the model. Two main types of inference attacks exist, ‘membership inference’ and ‘attribute inference attack’.

Membership inference attacks assume a situation where a record (for example, an entry about a person) and access to a machine learning model is given (for example, as black-box APIs), and the attack tries to identify if the record is included in the training data used for the development of the model. There are many cases where this attack can have a serious impact, for example, when the attack is attempted with sensitive personal data such as purchase records, location, or medical records. The basic idea of the attack is to learn the difference between the target machine learning system’s behavior with inputs that were already seen in the training data set and from the inputs from unseen data.

Attribute inference attacks assume a situation where an attacker already has partial knowledge about a data record and tries to infer the information of the missing attributes. The attack model is similar to the membership inference attack. The attacker can repeatedly query the target model with different possible values of a missing attribute and analyze the outputs to find out the value that’s indeed in the corresponding record of the training data set. For both attacks, overfitted models are more vulnerable to this attack since their behavior for the records in the training data and other general records varies more.

Next section

II. Technical solutions to mitigate attacks

I.