1.2 Grids+Face-Orange-ArticleHeroBanner-1110x452-ODI-Research

The age of foundation AI is characterised by models of large scale and high flexibility, capable of producing rich outputs. Recognising both the potential and the risks of these new models, the ODI has embarked on a programme of work on data-centric AI, designed to bring about an AI ecosystem grounded in responsible data practices. We’re exploring what policy interventions could be made to ensure these technologies are developed and deployed in ways that benefit everyone - people, communities and businesses. This is the second in a series of five pieces exploring these interventions.

What is intellectual property and how does it relate to training AI models?

Intellectual property refers to inventions, designs, artistic works and other ‘creations of the mind’.

Intellectual property laws are designed to enable people to earn recognition or financial benefit from the things they create. While laws are different across the world, most regimes seek to balance the interests of creators with the interests of the wider public.

Lawmakers recognise scenarios where people should be able to use the intellectual property of others quite permissively. Text and data mining is one of those. It involves compiling vast amounts of numbers, text and images, often from across the web, to reveal new insights. Given its importance for tackling web censorship and fraud, as well as other research, many jurisdictions have introduced exclusions to intellectual property laws that allow for text and data mining to be undertaken.

But while the web’s content has been scraped for decades, and text and data mining exclusions are in place, foundation AI has provided a shock to the intellectual property system.

In this article we’re focusing primarily on text, images, audio, video and artworks that are widely distributed on the web. We’ll talk about other types of data and about new datasets that are built specifically to train AI models in a later article in this series. Also, our focus here is on intellectual property during the training of AI models, rather than on how intellectual property would potentially apply to AI-generated content.

Why is intellectual property important in the context of foundation AI?

AI firms make different arguments as to why data scraping for AI training should be permitted. These include the reasoning that the scale of modern training datasets makes licensing negotiation impossible, or that the rationale behind text and data mining exclusions remains unchanged.

Many disagree. A number of large rights holders have taken AI firms to court over the way they have trained their models, with some seeking significant financial damages or even their destruction. Getty Images, for example, is suing Stability AI for allegedly training its AI model on more than 12 million of its photos without permission or compensation. In July 2023, the author Sarah Silverman sued OpenAI over its use of the Books3 dataset that includes the written works of thousands of authors. Around the same time, a letter was signed by more than 8,000 authors that argued 'millions of copyrighted books, articles, essays, and poetry provide the ‘food’ for AI systems, endless meals for which there has been no bill'. A survey run by the Authors Guild found that 90% of writers believe they should be compensated if their work is used to train AI models.

The training of models on the web’s content has caused rifts even among communities who intended for their works to be widely consumed. During 2023, many of Reddit’s biggest forums ‘went dark’ in protest over the platform’s plans to enable AI developers to access the mass of forum conversations they’d played a vital role in creating. Contributors to Stack Overflow, an internet forum for developers, have been banned from the site after they deleted their content in order to stop it being used to train ChatGPT.

Reforming the UK’s intellectual property regime is therefore key to making the AI data ecosystem benefit everyone, as well as ensuring we don’t enter a ‘data winter’. As put by Henry Farrell, 'if you want LLMs to have long term value, you need to have an accompanying social system in which humans keep on producing the knowledge, the art and the information that makes them valuable. Intellectual property systems without incentives for the production of valuable human knowledge will render LLMs increasingly worthless over time'.

Current policy status in the UK and elsewhere

Intellectual property lawmakers are responding to foundation AI in different ways.

Some nations are trying to create a permissive regime for model training. Singapore’s Copyright Act, for example, has been described as 'positioning Singapore as an attractive hub for AI developers'. Others are more interested in strengthening - or at least enforcing existing - rights holder protections and controls. In the EU, the Directive on Copyright in the Digital Single Market allows for text and data mining for scientific research purposes only, and rights holders can choose to opt out of their works being used for commercial AI training. The new AI Act says that any firm placing a general purpose AI model on the EU market must comply with this, regardless of where their models were trained.

The UK seems unsure what to do on this issue. Back in 2014, the UK Government introduced an exception that allows text and data mining for ‘non-commercial research’ only. In 2020, the UK Government said it intended to diverge from the EU to allow text and data mining in the UK for any purpose, on the basis that the changes could “help make the UK more competitive as a location for firms doing data mining”. Following Sir Patrick Vallance’s recommendation that the relationship between intellectual property and new forms of AI should be clarified, the UK Government backtracked and instead set out to work with publishers and AI developers to agree a ‘code of practice’. However, in February 2024, the UK Government concluded that 'the working group will not be able to agree on an effective voluntary code'.

More recently, the House of Commons Science, Innovation and Technology Committee concluded that an inbound government should bring these discussions to a close, suggesting a financial settlement for ‘past infringements by AI developers’ as well as a new licensing framework and government authority to oversee it.

Proposals from civil society, industry and other non-government actors

To some extent, the market is beginning to respond. Large rights holders - including news outlets, music labels, movie studios - have moved to make licensing deals with AI firms. OpenAI alone has signed deals with Associated Press, Shutterstock and Axel Springer. Google’s deal with Reddit for access to its forum data is said to be worth $60m per year. The developers of the model KL3M make a selling point that is was trained on 'a curated training dataset of legal, financial, and regulatory documents’, for clients that 'didn’t want to get dragged into lawsuits about intellectual property as OpenAI, Stability AI, and others have been'. Fairly Trained is a new non-profit created to certify that AI companies have trained their models on licensed content.

But who will ultimately benefit from an AI ecosystem that relies on costly licensing? Clement Delangue, the CEO of Hugging Face, has suggested that 'if we end up in a system where you can only train good AI models on $$ licensed data, there's a massive risk of concentration of power. It might not be the users, artists, or content creators who will benefit from this but big companies and Hollywood studios who will trade their rights and not redistribute'. According to the Open Source Initiative, an AI ecosystem overly reliant on licensing may end up less diverse and competitive, as small firms and academics do not have the financial means to go to court or enter into bilateral agreements to licence data.

There are also new efforts to create mechanisms for smaller, individual rights holders to control how their works are used. Sometimes described as ‘consent layers for AI’ or ‘preference signalling’, these consist of new web publishing protocols (eg W3C’s Text and Data Mining Reservation Protocol), technical tools (eg Nightshade) and data licences (eg Open Data Commons Licences).

While these may work for certain publishers, we can’t rely on them to solve the intellectual property conundrum. As Arvind Narayanan has argued, 'opt-outs are an ineffective governance mechanism. The structural problems with generative AI companies' business models — and the legal landscape that makes them possible — can't be solved by burdening individuals to withdraw their images one by one’. Creative Commons has expressed concern that 'if preference signals are broadly deployed just to limit [the use of data], it could be a net loss for the commons... these signals may be used in a way that is overly limiting to expression'.

Steps to take

We’re conscious that an incoming UK Government will be in a difficult position. It will want the UK to continue to be seen as a place for AI development, which would require a fairly permissive copyright regime, but it will also have the interests of our significant creative industries to protect.

To modernise the UK’s intellectual property regime, we recommend that the incoming UK Government:

At the ODI, we’re keen to provide insights and resources to policymakers working towards creating a fair intellectual property regime in response to foundation AI.

We will also publish further, related arguments for policy intervention in the coming weeks, focusing on data protection, the availability of more structured AI-scale training datasets and participatory data practices.