The age of foundation AI is characterised by models of large scale and high flexibility, capable of producing rich outputs. Recognising both the potential and the risks of these new models, the ODI has embarked on a programme of work on data-centric AI, designed to bring about an AI ecosystem grounded in responsible data practices. We’re exploring what policy interventions could be made to ensure these technologies are developed and deployed in ways that benefit everyone - people, communities and businesses. This is the third in a series of five pieces exploring these interventions.
How are people involved in the data supply chain?
There is significant human involvement behind the data used to train foundation AI models. People carry out tasks that computer models find it hard to replicate, such as collecting data, filtering and moderating it, and labelling it.
The global market for this type of data work was valued at $2bn in 2022 and is forecast to grow to $17bn by 2030. Most of this labour takes the form of ‘microtasks’, undertaken in low and middle-income countries. According to The Washington Post, more than two million people in the Philippines alone perform this type of work, including labelling pedestrians for automated driving algorithms, labelling photos of celebrities, and editing chunks of text to ‘ensure language models like ChatGPT don’t churn out gibberish’.
Many AI datasets will also include data about people, including names, pictures and location information. This data might have been scraped from public sources on the web, or collected and used by services we engage with – whether we use them for leisure or for work. Although definitions vary by jurisdiction, data about people is recognised in law as ‘personal data’.
While data protection and labour rights represent different perspectives and interests, we’re addressing them together in this piece. Ultimately, both involve the protection of fundamental rights and freedoms, and we see a similar risk across both areas, whereby the acceleration of foundation AI threatens the enforcement of existing protections. As we’ll discuss, there is also a useful confluence between the two emerging.
There are some topics we’re not going to address here. An expanded view of labour rights and AI supply chains would also include non-data workers, such as those mining for minerals used in computer components, but we’re conscious of our limited expertise in this area. And we address intellectual property in the context of training AI models in a separate post in this series.
Why are labour rights and data protection important in the context of foundation AI?
There are a number of risks to labour conditions and rights in the data supply chain.
First, data workers can be exposed to disturbing images and violent language. A worker with more than seven years in crowdwork described how, despite their exposure to graphic suicide content, received no content warnings, no counselling and no suicide hotline.
Workers also have a right to a standard of living and social security. However, a Time magazine investigation in 2023 found that AI workers in Kenya had been paid less than $2 per hour, and, being classified as independent contractors, lack social security protections like health insurance, pension contributions, and paid leave. There have also been accusations of union busting and mass layoffs following a 2019 strike. This precarity – combined with exposure to extreme content for moderation – has led to a mental health crisis amongst some Kenyan data workers. Another investigation of data annotation companies described a system where high-profit margins were prioritised over workers rights and safety. Other investigations have found data labelling work being carried out by minors.
Data workers tend to have limited access to effective remedy and grievance redressal. In some cases, firms remain anonymous, disappear and reappear frequently, making it incredibly hard to monitor and block bad actors. An Aapti report for the UNDP has described how workers can be penalised and locked out of systems following lower ratings.
These risks to labour rights are relevant to any UK organisation that uses AI models that have been trained elsewhere. But given the lack of transparency around the data used to train many of the popular AI models, organisations may not even be aware of how reliant on this labour they are.
From a data protection perspective, foundation AI models risk widening The Data Protection Enforcement Gap, whereby the stringency of regulations around personal data, on paper, is not matched by the practices of organisations in the real world. For example, foundation AI models are trained using vast amounts of data scraped from across the web, with many model developers appearing to think that any public data is fair game. As a result, twelve national data protection agencies, including the UK’s Information Commissioner's Office, published a joint statement to clarify that the mass scraping of personal information from the web to train AI can constitute a reportable data breach in many jurisdictions.
Many firms are now also changing their terms of service to enable them to use data generated by users to train new AI models. Meta has recently announced changes to its privacy policy, believing that it has a legitimate interest to override users' data protection rights in order to develop ‘artificial intelligence technology’. Max Schrems, a data protection activist and lawyer has criticised these changes for their vagueness and said that 'this is clearly the opposite of [data protection] compliance’.
And while the focus is largely on the inclusion of personal data in training foundation AI, there may be further data protection challenges downstream. Researchers have shown that it’s possible to cause ChatGPT to ‘leak’ data that its underlying model has been trained on.
Current policy status in the UK and elsewhere
In March 2023, the Italian data protection authority temporarily suspended the use of ChatGPT over concerns about the processing of personal data to train the system. This ban was lifted, before the Italian authorities found further data privacy violations in January 2024. In the US, a Californian lawsuit has claimed that OpenAI’s foundational models have been illegally trained on private conversations, medical data and information about children.
The announcement of the 2024 General Election curtailed the passing of the UK’s new Data Protection and Digital Information (DPDI) Bill through the legislative process. The ODI previously shared our disappointment that the proposed Bill would weaken transparency, rights, and protections.
There has been little said by the UK Government about the data supply chain for AI models, or about the country’s own data labelling industry. Although the largest markets are located in low-wage economies, there are some firms based in the UK – such as Prolific and Snorkel AI.
UK organisations in both the public and private sectors are now using pre-trained models, where data work, such as labelling, training, and testing, has already been completed. However, the complex supply chains involved in data labelling and safety testing for these off-the-shelf models can be far from transparent to these organisations.
In May 2024, 97 data labellers, content moderators and other data workers in Kenya wrote to President Biden to argue that 'US Big Tech companies are systematically abusing and exploiting African workers… [by] undermining the local labor laws, the country’s justice system and violating international labor standards'. The letter says that Kenya needs these jobs, but not at any cost.
In April 2024, the European Parliament adopted a new directive to improve the working conditions of platform workers (including data workers). The Directive brings in new rights for workers, including a presumption of employment, the prevention of algorithmic management decisions (such as hiring and firing), and greater transparency and personal data protections.
Proposals from civil society, industry and other non-government actors
We’re beginning to see some data labelling firms differentiate themselves through commitments to labour rights and ethical standards. Karya, for example, is a non-profit that partners with local NGOs to ensure access to its jobs goes first to the most in need or historically marginalised communities, and pays workers a proceeds of the sales of annotation work on top of their basic wages. As well as fairer compensation, data workers also desire more public benefit from their crowd work, and a greater empowerment in the connections between their work and downstream use.
However, the fact that this labour is so often unseen – as ‘ghost work’ – makes it unlikely that market forces alone will ensure that data workers in the supply chain of foundation AI are treated fairly. Organisations like Turkopticon, Fairwork and the Gig Economy Project continue the hard work of advocating for workers, and have turned their attention to the data supply chain through assessing labour standards and making collective demands for improvements. One of the participants in the ODI’s Humanity United-funded data for workers’ rights peer-learning network, CNV International has developed a Fair Work Monitor to strengthen the voice of workers through digital data collection. In 2021, workers at Appen, a crowdsourced data firm, began to organise with a technology union, showing that sectoral and company-based bargaining power could also be key to securing worker’s rights in the data supply chain.
There are some areas where labour rights and data protection converge. Data protection rights can be used in a work context, and organisations like Workers Info Exchange and AWO are using data protection laws to empower gig workers with data about their job histories, pay and rankings. Some laws also give workers a right to an explanation of how their data is used for automated decision making and to challenge unfair decisions that have caused detriment. While these are primarily being used by platform workers in sectors such as ride hailing and delivery, they could become important tools for workers in the data supply chain to address issues like opaque decision making, unfair work allocation and dismissals.
Steps to take
To avoid a labour rights and data protection enforcement gap when it comes to the data supply chain of foundation AI, we recommend that the incoming UK Government:
- Ensure that any future data regulation is fit to address foundation AI. As we said in our recent Policy Manifesto, ‘we believe the Data Protection and Digital Information (DPDI) Bill is a missed opportunity to strengthen the data ecosystem.’ Future data protection regulation should ensure that the Information Commissioner remains independent, data processing safeguards are maintained or enhanced, and Subject Access Requests do not become optional. These will all help to ensure that personal data is protected in the data supply chains behind foundation AI.
- Recognise data supply chains and protect the full scope of people’s rights within them. The incoming UK Government should ensure that existing regulations are being followed and poor practices are stopped within the UK’s data labelling market. Labour rights and data protection should also be central to the UK's AI safety agenda. These supply chains are global, and thus the incoming UK Government should cooperate internationally and use its influence to support global improvements in labour rights and data protection, particularly for data labelling and content moderation workers.
- Support the development of ethical standards in the UK's data supply chains. Support, strengthen and fund organisations who are setting just standards working practices within data supply chains. The incoming UK Government should support UK organisations to safeguard that their supply chains meet high ethical standards that stretch beyond compliance with data protection and labour rights laws.
At the ODI, we’re keen to provide insights and resources to policymakers working towards creating a system that protects data and labour rights in response to foundation AI. We will publish related proposed interventions in the coming weeks, focusing on the availability of data and participatory data practices.