In November 2023, the Open Data Institute launched its data-centric AI programme in response to an unprecedented wave of research, development, and policy-making about artificial intelligence in the wake of the launch of ChatGPT in 2022. As outlined in our white paper on data-centric AI, we stand at an important moment in AI innovation and deployment – one that will be fundamentally shaped by how we govern, steward, and share high-quality data.
The data-centric AI programme, supported by the Patrick J. McGovern Foundation, DSIT, Omidyar and more, addresses a critical opportunity in current AI discourse: while conversations about AI capabilities, safety and risks dominate headlines, it is data that represents the key to unlocking AI's full potential.
In October 2024, forty-five leaders from across the ecosystem – civil servants, academics, industry pioneers, and civil society experts – joined us for an afternoon of intensive discussion about the foundations and future of data and AI. The event unfolded across three distinct but interconnected sessions based around our white paper. Each session offered unique insights into the challenges and opportunities of data-centric AI, with a particular focus on the need for strong data governance, robust data infrastructure and meaningful transparency to realise the value of data for AI. These discussions directly inform and complement our ongoing work with policymakers worldwide as we help shape the legislative agenda for responsible AI development to realise the importance of data for AI.
Realising the opportunities of AI with well-governed data
The afternoon began with a powerful reminder of what's at stake in the AI revolution. Emma Thwaites, our Director of Global Policy and Corporate Affairs, led a discussion with experts who painted a vivid picture of how data governance - how decisions are made about how data is collected, used and shared - shapes AI's real-world impact. Joe Cuddeford of Smart Data Research UK, Kasia Odrozek of Mozilla's Data Futures Lab, and Dr. Wen Hwa Lee of Action Against AMD, discussed the importance of investment in strong data governance for AI. The panel was brought to life with real-life examples, such as Common Voice, a community-generated dataset of underrepresented languages for training large language models, and the INSIGHT Hub, a repository of eye scan data used to train AI models that makes decisions about access to data via a participatory data access panel.
The need for infrastructure to achieve AI benefits and mitigate harms: the Croissant metadata standard
Next, Omar Benjelloun of Google presented Croissant, a groundbreaking metadata format for AI datasets that promises to make data documentation machine-readable and standardised. His presentation sparked enthusiastic discussion about how technical standards and governance principles must work together to enable responsible AI development. Omar framed the challenge in compelling terms: if we think of data as code, we lack the robust tool ecosystem that software developers take for granted. This gap becomes particularly acute as AI systems grow more complex, often incorporating structured and unstructured data across multiple modalities. Some in the room raised the challenges and costs of implementing Croissant at scale. However, the potential to make government datasets more ‘AI-ready’ while maintaining transparency and accountability emerged as a crucial opportunity.
Prioritising trust and empowerment: AI data transparency
The ODI's work on data transparency has long emphasised the need for practical, implementable solutions that bridge technical capabilities with governance needs. Building on this foundation, including some recent ODI research, and the rich discussions of our previous sessions, this workshop session brought together diverse stakeholders to shape the development of our AI Data Transparency Index (AIDTI), launching in December 2024 —a tool designed to measure and promote better transparency practices in AI systems. The workshop revealed how traditional approaches to transparency often fall short of user needs. Standard numerical metrics about training data—such as sample sizes or basic statistical measures—are an important foundation but do not achieve the purposes of transparency by themselves. Instead, participants emphasised the importance of understanding decision-making processes: why certain data was included or excluded, how it was processed, and what alternatives were considered. The workshop also highlighted a crucial challenge for our transparency index: the need to address multiple audiences whilst maintaining practical utility. Much like environmental impact assessments have evolved to serve both regulatory and public information needs, AI transparency must balance technical rigour with accessibility. By addressing documentation and explainability together while remaining focused on practical implementation, we aim to create a framework that helps organisations move beyond mere compliance towards meaningful transparency that serves the public good.
Moving Forward: A Data-Centric Approach
As AI development accelerates, focusing on data infrastructure, governance, and stakeholder engagement is critical. The ODI will continue to convene these crucial conversations and work with our partners across sectors to develop practical solutions. Our upcoming AI Data Transparency Index represents one concrete step toward better governance, among much more work that needs to be done.
We stand at an important moment in UK tech policy. As the government embarks on its new legislative agenda, we continue to champion the need for data and AI to work for everyone. Through the data-centric AI research programme, this event and our wider work, we see several areas that require focus from Government:
- We need to strengthen the foundations of AI-ready data infrastructure. This requires a robust legislative and regulatory regime.
- The planned National Data Library presents an opportunity to set new standards for what good data infrastructure looks like and enable the government to be a provider of AI-ready data.
- The newly laid Data (Use and Access) Bill focuses on enabling stronger data governance to ‘harness the power of data for economic growth, support modern digital government, and improve people's lives’. The bill offers some positive steps in this regard but more specific work will be needed on data for AI.
- Trust and transparency are essential for the safe and effective rollout of AI in the public interest. Increased transparency about the data used to train especially high-risk AI models will be essential, and we must work towards developing world-leading standards for AI assurance and auditing.
- Finally, looking ahead to the anticipated AI Bill, data should be equally as prominent in the legislation as the models, including clear transparency reporting for different types of AI data. We have started thinking about how this might be done in our Taxonomy of Data for AI.
By focusing on strong data infrastructure, effective data governance, and meaningful transparency, alongside open data principles, trust, equitable access, and skills development, we can build a thriving, ethical, and innovative data ecosystem that serves the UK's economic and social goals.
We invite policymakers, industry leaders, and civil society organisations to engage with us in this crucial work. Together, we can shape a data-enabled future that is not only innovative but trustworthy, inclusive, and beneficial for all. Please get in touch at [email protected] if you’d like to work together or learn more about our work on data-centric AI.