V.

The steps in building AI applications

Before we jump into the details of fairness, explainability, and resiliency, let’s make sure you have a basic understanding of the steps involved in building AI applications. This will help you recognize where in the process each of the methods, security attacks, and analytics fits in.

Terminology

The machine learning pipeline

An AI model is built using a machine learning “pipeline”, which defines the process from data collection and preparation to training and deployment of the models. These pipelines can also contain steps related to validation, metrics monitoring, and so on, adding to the robustness of the resulting system.

There are different tools, practices, and even roles tied to different stages of the pipeline. For example, model development work might be for a data scientist, whereas building the software serving the model online might be done by an engineer or backend developer.

Let’s look into the different steps of a machine learning model pipeline.

Process chart of the steps in the machine learning pipeline
Process chart of the steps in the machine learning pipeline

An example of a machine learning pipeline

AI development starts with an exploration phase – the data scientist will look at different approaches to solving the problem at hand using AI algorithms. After an AI algorithm is selected for an application, an AI model is then trained using the data at hand. When planning a new product using AI, it’s important to keep the pillars of trustworthy AI solutions in mind already at the planning stage. Ensuring that data privacy, model explainability, and fairness are taken into account is easiest when designing them into the pipeline from the very beginning.

Step 1: Data collection

In this first step, data is acquired from different sources and vendors. It also can be acquired via crowdsourcing/crowdsensing methods that allow voluntary data contribution by people using their personal devices – for instance, data captured from smartphone sensors like GPS, a camera, or an accelerometer. At this phase, enough data must be collected so that it’s possible to have available data for training and validation of the model. This is also the stage where matters of privacy arise for the first time: what data can be used, how and where can it be stored in a safe way, and what risks lie ahead in its usage?

Step 2: Data preparation

Depending on the AI model used, several data preparation (or data pre-processing) sub-steps may be required to transform it into a suitable input for AI algorithms. Data scientists are often faced with datasets that may include missing or invalid data, resulting in low accuracy and performance during the learning process. Hence, data preparation, or data pre-processing, cleans the data and prepares it for learning. The data preparation phase also answers to challenges around bias and fairness in the dataset and some of these issues are solved already at this stage, for example by modifying the existing data to better represent the desired outcome of the model. One of the topics that comes up in the process of data preparation is data labeling. We’ll talk more about labeling and its meaning in the following chapters. Labeling is relevant to supervised and unsupervised learning.

  • An example of supervised learning would be that we have a dataset of images of cats and dogs, where each image has the label “cat” or “dog” telling the algorithm what it represents.

  • In unsupervised learning the data isn’t labeled – for example, we might have a dataset of members of a loyalty program of a shopping mall and can use the unlabeled data to find groups of similar shoppers among them by clustering approaches.

In case we need labeled data, that step is also done as part of data preparation. Data labeling can be done automatically or manually – in fact, manual labeling “sweatshops” in low-income countries is one of the hot topics in recent AI ethics discussions.

Step 3: Model training

Once proper data input is in place, the training process is configured. This is the stage where the requirements for the explainability of this application need to be evaluated. Based on this need, the choice between inherently explainable “glass-box” models and not directly explainable “black-box” alternatives is made. Depending on the model used, this may be a computationally heavy task. In these cases, the training process can be done across multiple devices to make the training faster.

After the model is trained, its performance and generalizability must be evaluated using data never seen during training. One approach to achieve this is to split all available data and create a dedicated hold-back test set used for evaluation purposes. If there is a requirement for explainability, this is also the stage at which some of those processes may live, and various fairness and explainability metrics of each trained version of the model can be stored. Monitoring the training processes can also give us valuable insight into shifts in the model or data that impact our product. Hence, model monitoring, both at the training phase and later in the deployment, plays a key role in ensuring the robustness of the system.

Step 4: Model deployment, inference, and incremental training

Commonly, AI robustness is characterized by focusing mainly on accuracy and performance (for example, precision, recall, and resource utilization). Once the model is trained and the performance evaluated, it can be deployed within applications. Two common ways of using AI models are online inference, which means that the model is used in real-time (often over the internet), and batch inference, where the model might be run for example once a week for larger batches of accumulated data. In either of these approaches, models can continue learning (re-training) as more and new data is collected and observed. As part of a robust and secure solution, this is the step where engineering expertise is useful to provide reliable and secure access to the model.

Part summary

Here’s what you’ve learned in this first chapter of this course:

  • AI has a huge societal impact in business, media, entertainment, government, and infrastructure already now.

  • It’s important to consider the ethical implications of AI on human rights when developing and implementing AI solutions.

  • The methods and concepts of trustworthy AI such as transparency, fairness, explainability, resilience, and privacy preservation help prevent and solve ethical problems of AI solutions.

  • Trust and trustworthiness aren’t only technical components. So, while AI tools can improve efficiency and effectiveness for many businesses, a focus on trustworthy AI requires businesses to increase the trustworthiness of their internal processes.

  • Trustworthiness is considered at different steps of building AI applications. In a machine learning pipeline, the steps include data collection, data preparation, model training, and model deplyoment.

Now that you’ve learned this, you can better understand and evaluate the purposes and roles of AI tools and consider who might benefit and who might not.

Next Chapter
2. Fairness and accountability