From controlling access to information to how resources are allocated, the influence and control of algorithms are ubiquitous in our lives today. Despite technology’s ability to help humans, it can also perpetuate and exacerbate issues of systemic oppression and bias.
At Multitudes, we collect behavioural data that can be sensitive, so it’s important that we have robust data ethics principles (discussed in more detail here). In this blog post, we discuss the steps involved in developing a machine learning product, how biases can be introduced during this process (algorithmic bias), ways to mitigate or eliminate these biases, and how we’re putting these principles into practice at Multitudes.
Algorithmic bias refers to the ability of algorithms to systematically and repeatedly produce outcomes that benefit one particular group over another. For example, consider the task of image classification using the following two images:
Both are typical wedding photographs, one from a Western wedding and another from an Indian wedding. However, when neural networks were trained on ImageNet — one of the world’s most widely-used open-source datasets that contains more than 14 million images — they produced very different predictions for these two images (read the paper here). Predictions on the image of the Western bride included labels such as “bride”, “wedding”, “ceremony”. In contrast, for the woman wearing a traditional Indian wedding dress, the predicted labels were “costume”, “performing arts”, “event”. Though this example may be considered trivial by some, such errors are not uncommon, especially given the lack of diversity in the datasets that data scientists typically use for the purposes of model building. So it is not a surprise that there are already many examples in society where algorithms have harmed marginalized groups (see here, and here).
The machine learning (ML) lifecycle can be broken up into the following five steps:
Biases can arise within each step in the ML lifecycle, so we need steps to mitigate them.
What the step means
This is the first step in the ML process, where one collects, labels and prepares data for modelling and analysis purposes.
An example of how bias can arise
Issues arise when the data collected doesn’t fully reflect the real world. For example, studies about such biases have demonstrated that almost all of the very popular image datasets – ImageNet, Coco and OpenImages – contain images mostly from Europe and North America, despite the majority of the world’s population being in Asia. As such, models trained on these datasets perform worse for people from continents such as Asia and Africa.
Note that there are many more ways that bias can leak into the data collection process; this is just one example.
Example of an action to mitigate this
Ensure that the data is collected in a manner that reflects reality. In fact, because historical data is collected from a society in which systems of oppression operate, you might even want to over/under-sample data from marginalized groups in order to move towards more equitable datasets. One example of when you might want to do this would be for a facial recognition app: Since there are fewer images of BIPOC folks, you might want to oversample images of them. A fantastic resource to learn more about equitable data collection practices is Timnit Gebru’s article “Datasheets for Datasets” which proposes a framework for transparent and accountable data collection.
What the step means
When building models for the purposes of prediction, we construct features. A feature is simply a characteristic of each data point that might help for the purposes of prediction. For example, if we are predicting the price of a house, a useful feature might be the number of bedrooms or the postcode the house is located in.
Example of how bias can arise
Consider a model that predicts whether a police officer should be deployed in a particular suburb based on past incarceration data: A data scientist may claim to have built a model which is “socially neutral” as they have removed all features that correspond to race, age, and gender. However, other features like postal code might also correlate with features such as race (because in the real world, suburbs are segregated by race). In fact, this study demonstrates the potential for predictive policing to propagate and exacerbate racial biases in law enforcement.
Example of an action to mitigate this
The easiest counter-measure is to critically examine the relationships between features that may correlate. In the example above, despite removing features about race, age and gender in the modelling process, one should still look for other features (such as postcode) that correlate with demographic features. It’s also important to initiate and maintain contact with communities and stakeholders from different marginalised groups and have a participatory approach to ML. This paper introduces Community Based System Dynamics (CBSD) as a way to engage different groups in order to design fairer ML systems. As such when designing and deciding on features for models, it’s important to engage with the community who would be most impacted by the model and get their feedback. Even still, it is not clear that this is sufficient to eliminate all biases.
What the step means
This involves assessing the accuracy of a model’s ability to predict a certain outcome – for example correctly predicting a person’s face in a facial recognition system.
Example of how bias can arise
In a recent study, the authors discuss intersectional model analysis as a tool to assess model accuracy, inspired by the sociological framework of intersectionality.
“Intersectionality means that people can be subject to multiple, overlapping forms of oppression, which interact and intersect with each other.” - Kimberly Crenshaw
In the “Gender Shades Project”, researchers used this approach to examine companies who were selling facial recognition technologies that boasted accuracies of up to 90%. However, when the accuracy was broken down by different intersectional sub-groups, it was found that the error rates for darker-skinned women were as high as 34.7% – whereas for lighter-skinned males, the error rate was only 0.8%. In hindsight, this is hardly something a multi-trillion-dollar business should be selling at scale, let alone promoting as “accurate”.
Example of an action to mitigate this
It’s necessary for Data Scientists to advocate for measures of model performance that contain results broken down by intersectional subgroups. This is another reason why having a representative dataset matters – so there’s enough data to evaluate the model’s accuracy for different demographic groups. The approach of using model cards discussed here is a great resource for evaluating model performance.
What the step means
Model explainability is a concept which looks into the ability to understand the results of a machine learning model. The extent to which a model's results are explainable to stakeholders should be a key consideration when evaluating different models, especially in human-centric applications.
Example of how bias can arise
Many examples exist of individuals being unfairly impacted due to the output of a model. In 2007 a teacher was fired from a Washington DC school due to an algorithm: Despite having highly favourable reviews from students and parents, an opaque algorithm was used to determine her performance as being in the bottom 2% of all teachers.
Examples of an action step to mitigate this
When humans interact with ML systems it is imperative they understand exactly how and what personal data will be used and also why a model is being used in the first place.
Product people, software developers, and designers should have a high-level understanding of the ML system they are building, so they can probe what data is being used, and all the ways that the model's predictions might impact an end user's decisions in the real world.
For data scientists, there are many tools available to help understand model behaviour such as SHAP (SHapley Additive exPlanations), which allows for an understanding of the effect of different model features. When utilising techniques such as deep learning models – where the models identify and abstract features in the data that humans wouldn’t be able to – data scientists can utilise tools such as LIME, which was designed to work on any black-box algorithms.
What the step means
Once we’ve trained up a model, evaluated that it is working effectively, and completed the R&D process, models are then deployed into production.
Example of how bias can arise
At this step, it is important to ensure that the model is being used for its intended purposes. Often we can introduce bias into a model because there are inconsistencies between the problem a model was built to solve and the way it is used in the real world. This is especially the case when it is developed and evaluated in a totally self-contained environment, when in reality it exists as part of a complex social system with many decision-makers. For example, Microsoft’s NLP bot learnt racial slurs in less than 24 hours of being exposed to Twitter. Another issue is that data in production drifts over time – a phenomenon known as concept drift. The result is a degradation in model performance.
Example of an action step to mitigate this
It is necessary to consistently track the quality of the input data. Without robust monitoring in place, the distribution of the input data can revert to becoming more biased, even if the model creators ensured diversity in the initial dataset. This causes a model to be less performant for certain demographics, which means that previous work done to ensure that ethical considerations were managed appropriately can become irrelevant. We can track this by comparing the distribution of new input data from production with the training data used in model development. It’s also important to label, version, and date the models being used in production so it is easy to roll back and even switch off models that are performing poorly.
For us at Multitudes, one of our data principles is that if we get data from someone, we should make sure that they get value from it, and that we use the data transparently. In addition, we never show individual performance data – our data is always aggregated to the team or organisation level.
In our development of ML products, we:
Data Collection and Preparation:
Feature engineering & selection
Model evaluation
Model deployment
At the end of the day, humans are the ones who create algorithms, so we also recognize the importance of the broader culture and environment we create at Multitudes. Some things we consider are:
This article has been a broad overview of some of the ethical pitfalls of machine learning systems. The hope is to provide points to consider when dealing with ML Systems, as well as an example of how we’re implementing these mitigations so far at Multitudes.
However, the subject of “Equity and Accountability in AI” is a vast and well-studied field and we’ve hardly scratched the surface. We hope this encourages everyone from AI researchers to end-users and the general public to have sustained dialogue on the importance of ethical considerations when building and interacting with ML systems. Moreover, it’s worth noting that reducing algorithmic bias is not the full answer – the bigger, more important task is to dismantle systemic oppression. As individuals and as a collective, we can take action to create a more equitable world by making choices in what we consume, how we live, how we work, and who we vote for.
This article barely touched the surface of this vast topic. Here are a bunch of resources that you can use to find out more about equity and accountability in data science - and we’ll keep sharing our learnings and approach as we go!
Examples of Unethical AI systems in society
Research Groups and Organisations
Toolkits, Code and Other Fun Stuff
Books