What is Adversarial Machine Learning?

In the Cybersecurity sector Adversarial machine learning attempts to deceive and trick models by creating unique deceptive inputs, to confuse the model resulting in a malfunction in the model. 

What is Adversarial Machine Learning?
Clint Patterson via Unsplash


With the continuous rise in Machine Learning (ML), our society becomes heavily reliant on its applications in the real world. However the more dependent we become on Machine Learning models, the more vulnerabilities on how to defeat these models. 

The dictionary definition of an "adversary" is: 


"one that contends with, opposes, or resists"


In the Cybersecurity sector, adversarial machine learning attempts to deceive and trick models by creating unique deceptive inputs, to confuse the model resulting in a malfunction in the model. 

Adversaries may input data that have an intention to compromise or alter the output and exploit its vulnerabilities. We are unable to identify these inputs through the human eye, however, it causes the model to fail. 

In Artificial Intelligence systems, there are different forms of vulnerabilities such as text, audio files, images. It is much easier to perform digital attacks, such as manipulating only one pixel in input data which can lead to misclassification. 

To train machine learning models efficiently and produce accurate outputs, you will need large sets of labeled data. If you are not collecting the data from a reliable source, some developers use datasets published from Kaggle or GitHub, which come with potential vulnerabilities that can lead to data poisoning attacks. For example, somebody may have tampered with the training data, affecting the model’s ability to produce correct and accurate outputs.

There are two different types of adversary attacks: whitebox and blackbox.


Whitebox vs Blackbox Attacks

A whitebox attack refers to when an attacker has full access to the target model. This includes the architecture and parameters, which allows them to craft adversarial samples on the target model. White box attackers will only have this access if they are testing the model, as a developer. The developers have detailed knowledge of the network architecture. They know the ins and outs of the model and create an attack strategy based on loss function. 

A blackbox attack refers to when an attacker has no access to the target model and can only examine the model’s outputs. They do this by using the query access to generate adversarial samples. 


Adversarial Attacks on AI/ML

There are different types of adversarial attacks that can occur. 



Attacks on machine learning models during the training phase are referred to as ‘poisoning or ‘contaminating’. This requires the adversary to have access or control over the training data, from what we understand is known as a white-box attacker. 

An adversary inputs incorrectly labeled data to a classifier which they class as harmless, however, it has a negative impact. This will cause misclassification, producing incorrect outputs and decisions in the future. 

A way in which an adversary can manipulate this is using their understanding of the model's outputs, allowing them to attempt to slowly introduce data that decrease the accuracy of the model, which is known as model skewing. 

For example, search engine platforms and social media platforms have a recommendation system built-in using machine learning models. Attackers manipulate the recommendation systems by using fake accounts to share or prompt certain products or content, altering the recommendation system.


Evasion attacks

Evasion attacks typically occur once a machine learning model has already been trained and new data has been inputted. These are also known as a white-box attack, as the adversary has access to the model and uses a trial and error process to understand and manipulate the model. 

The adversary has a lack of knowledge of the model and what will cause it to break, therefore the trial and error process is used.

For example, an adversary may tune the boundaries of a machine learning model that filters out spam emails. Their approach may be to experiment with emails that the model has already trained to screen and recognise as spam. 

If the model has been trained to filter out emails containing words such as ‘make fast money’; an adversary may create new emails that include words linked to this or are very similar, which will pass through the algorithm. This causes an email that would normally be classified as spam to not spam, diminishing the model. 

However, there are more malicious causes such as adversaries using models such as Natural Language Processing (NLP) to obtain and extract personal information such as identification numbers, leading to more personal attacks. 


Model extraction

A form of blackbox attack is Model extraction. The adversary does not have access to the model, therefore their process is to try and rebuild the model or extract the output data. 

This type of attack is prominent in models that are very confidential and can be monetised, such as extracting a stock market prediction model.

An example of a blackbox attack is the use of Graph neural networks (GNN) which are widely used to analyse graph-structured data in application domains such as social networks and anomaly detection. GNN models are valuable property, becoming attractive targets to adversaries. 

The owner of the data trains an original model, where the adversary receives predictions of another model which imitates the original model. The adversary may charge others for access on a pay-per-query basis for these outputs to duplicate the functionality of the model. This essentially allows the adversary to recreate the model by using the process of continuous tuning to duplicate the model.


How to Avoid Adversarial Attacks

Below are two simple ways in which companies should implement to avoid adversarial attacks.


Attack and learn before getting attacked

Adversarial training is one approach to improve the efficiency and defense of machine learning and that is to generate attacks on it. We simply generate a lot of adversarial examples and allow the system to learn what potential adversarial attacks may look like, helping it to build its own immune system to adversarial attacks. This way the model can either notify or not be fooled by each of them. 


Changing your model frequently

Continuously altering the algorithms used in the machine learning model will create regular blockages for the adversary, making it more difficult for them to hack and learn the model. A way to do this is by breaking your own model, through trial and error to distinguish its weaknesses and understand the changes that are required to improve it and reduce adversarial attacks. 



A lot of companies are pursuing the investment of AI-enabled technology to modernize their ability to solve problems and make decisions. The Department of Defense (DoD) particularly benefits AI, where data-driven situations can increase awareness and speed up decision-making. DoD has committed to increasing the use of AI therefore it must test the capabilities and limitations to ensure that the AI-enabled technology adheres to the correct performance and safety standards. However, this is a big challenge as AI systems can be unpredictable and adapt to behavior. 

We have to be more vigilant and understanding of the risks associated with machine learning and the high possibilities of data exploitations. Organizations that are investing and adopting machine learning models and artificial intelligence must incorporate the correct protocols to reduce the risk of data corruption, theft, and adversarial samples. 

Nisha Arya is a Data Scientist and freelance Technical writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.