This report is part of the research project ‘From co-generated data to generative AI’ and sets out the work completed between June 2023 and May 2024. The project was commissioned by the Global Partnership on Artificial Intelligence (GPAI) and delivered by the Open Data Institute (ODI) and Aapti Institute, with support from Pinsent Masons and CEIMIA.
Data has become an important part of society - where it was once a by-product of industrial, commercial, consumer and other activities, it is now a resource in its own right. Data enables the development of new and improved technology that have become important parts of our lives. At the same time, access to data is critical to tackling some of society's biggest challenges, such as the climate crisis and health inequalities.
The rapid expansion of the data economy raises serious questions for governments, businesses and the public about who has access to data, who gets to decide what data is used for, and ultimately who is able to realise the value of data. It also raises questions about how to limit the misuse of data, how to preserve people's privacy and how to hold those causing harm to account.
There is increasing attention being paid to the notion that parties who have contributed to the generation of data should have some rights in the utilisation of such data. There have recently been debates over the rights of users of Internet of Things (IoT) devices, such as sensors within autonomous vehicles or smart speakers. Substantial amounts of data are collected for the development of this technology, but users of these devices, who are critical to the generation of this data, may not have rights over the data.
This phenomena is particularly notable in the developments around generative AI. It is well documented that without data, there is no AI. Data is the cornerstone of AI models, guiding their development and deployment at every stage. For generative AI, much of this data is scraped from the internet, collecting conversations on social media, videos shared on Youtube, and images posted online.
What is co-generation?
‘Co-generated data’ refers to data generated by more people or entities than solely the data holder. However, co-generation is not limited just to data. Co-generation occurs across the data ecosystem; when we scroll through social media, when we speak to our voice assistants and when we prompt conversational AI tools. In each of these examples there can be multiple data co-generators, with different levels of involvement in the co-generation process and awareness of its potential financial, social, legal or ethical implications.
As generative AI enters the mainstream, we must update our thinking about co-generated data, technology and AI-generated works. On one hand, there is the need to create a fairer technology ecosystem, in which members of the public, creators and communities should have strong legal rights to protect them and their livelihoods. Equally, there is the question of access to data and content, for innovation and the development of these new technologies. Access to data is a crucial part of innovation for AI and beyond. There needs to be a balance between prioritising innovation, while ensuring the rights of people and communities.
What are the challenges around co-generation in the era of generative AI?
This research involved the analysis of six scenarios of co-generation, using a framework of legal rights developed through a literature review, as well as interviews and workshops with experts from around the world which provided additional in-depth insights. The six scenarios cover a wide range of types of co-generation, co-generators and legal contexts:
- Remunerated work (Karya),
- Collaborative crowd-sourced data collection (OpenStreetMap),
- Social media platforms (Instagram),
- Internet of Things (BMW connected cars owned by Europcar),
- Conversational generative AI (Whisper and Midjourney)
We found that co-generators in these scenarios are subject to a web of different rights, across intellectual property rights, data rights and others, like labour rights. However, there still remain gaps, in particular concerning the co-generation of technology and AI co-generated works, as well as rights for communities and collectives.
In practice, where rights exist, they are brokered through terms and conditions (T&Cs), contracts and licences. The practical application of these methods do not always serve their purpose, for example the failures of T&Cs are well documented. Similarly, the ability of co-generators to apply copyright protections is difficult, and is not always set up in the interest of the co-generators. For example, where existing IP laws recognise rights over works, the use of standard contractual documents such as terms and conditions can lead to the transfer of IP rights from creators to bigger entities, leaving co-generators without protection.
Finally, deciding if and how rights should exist over different types of co-generated data is difficult, particularly for generative AI models. This is because generative AI models do not always disclose where the data has been collected from, making it difficult to know with certainty that data or content has been included. Most training datasets are built using data from multiple sources, which means they are likely to consist of data with varying and incompatible licences. Finally, the huge scale of data needed to train some AI models creates difficulties for developers in verifying ownership and licences across millions of pieces of data or checking for inclusion of personal data without a legal basis.
The world of co-generation is complex. Further research is needed to fully map the landscape of rights, where there are gaps, how rights overlap, and their practical applications, as well as looking into non-legal mechanisms like new technologies, governance models and licences.
If you’d like to learn more about this work, or our related work on participatory data or data-centric AI, get in touch at [email protected].