Project

Data-centric AI

Without data, there would be no AI. To deliver on AI safety we need to consider the data infrastructure of existing and future applications of AI.

9.1 Face-Blue-ArticleHeroBanner-1110x452-ODI-Research

Tue Nov 28, 2023

Artificial Intelligence (AI) may not be a new concept in the technology world, but the public release of ChatGPT a year ago marked a step change. The release gave anyone with access to the internet the ability to "talk" to an AI programme like ChatGPT, Claude, or Midjourney using text prompts rather than specialist language. It sparked an unprecedented wave of research, development, and policy-making that advanced our understanding of the technology and how it could be used ethically and equitably. It also sparked a lot of fear, uncertainty and doubt, including concerns about data privacy, the use of copyrighted content, and authenticity.

Beyond the hype, recent progress in foundational models (FMs) and their accelerated adoption by businesses and government bodies can bring significant opportunities for efficiencies, economic growth and innovation, but there are also significant risks of misinformation, job losses, discrimination, and social inequalities. Balancing these features requires an ecosystem view of AI, which acknowledges the role of data, computing, governance and regulation to move the field in the right direction over the next few years.

What is data-centric AI and why is it important?

Without data, there would be no AI – that applies to any form of AI, from deep learning, reasoning and planning to knowledge graphs. We need to look closer at the links between data and algorithms, drawing on approaches from multiple disciplines and engaging those directly affected by AI, as well as civic society. The latest wave of large language models (LLMs) and other FMs has disrupted how we think about many components of our data infrastructure: from the value of data we publish openly and the rights we hold on data, both individually and collectively, to the quality and governance of critical datasets. We are using the term ‘data-centric AI’ to advance our thinking in this space – the term was introduced a few years ago in the AI community to advocate for more attention to the data that AI engineers feed into their models. Expanding on the term, we use it to refer to the entire socio-technical data infrastructure of AI – this includes data assets, tools, standards, practices, and communities.

This content is not shown because you have denied third-party cookies. You can view it at https://vimeo.com/884929644, or update your cookie settings

To deliver on AI safety and follow through on the commitments from the Bletchley Declaration, and other recent announcements and global regulations, we need to consider the data infrastructure of existing and future applications of AI. This goes beyond ongoing efforts to create benchmark datasets that, although useful for evaluating and comparing models, do not represent the vast scenarios in which AI is envisioned to be applied. As generative AI gains traction, there is a risk that, given the costs associated with good data practices, models will be trained and tested on synthetic or lower-quality data, leading – in time – to a degradation in performance and increasing the likelihood of harm. AI data infrastructure and better data practices should be adopted and mandated across industry, informed by the latest advances in data science and engineering, and supported by dedicated data institutions.

What our data-centric AI programme aims to achieve

Building on more than a decade of work creating open, trustworthy data ecosystems, the ODI has helped shift the AI narrative away from an exclusive focus on model development and use towards a wider understanding of the resources – and stakeholders – needed to enable sustainable and responsible technological development. The ODI acts as a key institution researching, connecting, and amplifying diverse ideas and approaches, developing and enabling best practices for data stewardship, and convening a wide range of stakeholders in the ecosystem, including startups, entrepreneurs, researchers, policy-makers and civic society, to help develop an AI data ecosystem grounded in responsible data practices.

Realising the potential of AI to benefit everyone and meet the commitments of the Bletchley Declaration, will require several essential steps in data-centric AI:

Make data AI-ready

We need to enable and support the creation of high-quality AI datasets. Many AI datasets are small, synthetic, or not representative of a particular country, company or context. The result is benchmark saturation - models perform well on the data that is available, but worse when applied to solve real problems.
Existing copyright, data protection, and worker rights must be respected when creating new AI datasets. We need more research to identify gaps in how these rights are currently protected or not in the datasets used by AI systems.
Key AI datasets must be responsibly stewarded and governed. Some datasets are critical for specific sectors and need wise stewardship mechanisms to ensure they are used equitably and maintained to a high standard.
Public-good datasets should be continuously supported as they boost innovation in many areas, including AI. A lot of progress in AI has been on the back of open datasets, but there is a danger people would stop contributing and investing in open data and most new data fed into AI models will be synthetic or of lower quality.
Best practices in AI data assurance must be established and standardised. While some toolkits are emerging, there is limited guidance or regulation to assure datasets used in public services.

Make AI data accessible and usable

We need to work with data holders to study critical datasets. Most datasets are poorly documented, which means that users find it difficult to understand their intended purpose, known use cases, and limitations.
Fair and equitable data access must be mandated to develop AI use cases with big societal implications e.g. misinformation, climate, and infectious diseases.
Data standards should be developed to reduce the cost of data operations and allow researchers and smaller businesses to build better AI data infrastructure.
Safe access to datasets for startups and SMEs must be enabled, to boost responsible experimentation and innovation. This is one of the main roadblocks alongside access to computing and AI talent.
The potential for new AI capabilities in making data more accessible, usable, and useful for everyone should be explored. There are opportunities for AI to automate or streamline processes that currently restrict or delay data sharing and use.

Make AI systems use data responsibly

Explore mechanisms that improve understanding of data in the AI lifecycle. This includes exploring whether AI tech holders and downstream applications should be required to share information about data provenance and lineage to foster good data practices in the ecosystem and more thorough analysis of impacts.
Invest in research and innovation to develop more protective and efficient AI models. This can aid the development of models that are less reliant on huge datasets, as well as models that rely less on massive amounts of data and do not trade off privacy and data protection for performance.
Design, assess and promote more meaningful data licences that support publishers and users to implement good data practices in AI.
Invest in creating more practical toolkits to inform new regulations and reduce compliance costs. Toolkits, case studies, and peer learning have a role to play in operationalising existing and emerging regulatory frameworks.
Strengthen responsible AI practices through research, training and data literacy. AI engineers should be trained in responsible AI practices, and non-technical workers should have access to tools and training that help them understand the links between data and responsible AI.

The ODI is committed to advancing research and practice to implement these steps. This includes:

frameworks to define and assess the value of AI data and AI business models;
landscape reviews of emerging data-centric AI technologies such as federated learning and ML data assurance;
innovation programmes supporting startups and artists in using data and AI responsibly to tackle grand challenges such as misinformation; ecology and international cooperation.
learning courses on data ethics and AI and machine learning.

Building on this work, in the next few months, we will:

map out the journey of data in AI model design, training, validation, testing and use to highlight common challenges in responsible data practices for AI
study existing and emerging data stewardship and governance practices in the most popular AI datasets
understand and describe the role of data-related challenges in AI incidents reports
design computational approaches to assess the impact of open data sources on AI models' performance
propose participatory approaches to data prompting as a means to help diverse audiences use generative AI tools to find and make sense of data
establish policy priorities for a future AI bill in the UK, and other data legislation, building on the findings of the programme

There is much to do, and as technology and regulation move at lightning speed, we must prioritise this field of enquiry – and work quickly to transform ideas into action. We are excited to collaborate with Microsoft, the Industry Data for Society Partnership, King's College London, the University of Oxford, and many others to make progress and make a difference.

We are keen to hear from funders, partners and other organisations who are interested in helping to develop our programme of work. If you would like to contribute, talk about funding our work or challenge our thinking we welcome hearing from you.

Work in this programme

Data-centric AI webinar series

Blog

Learn from world experts on the data in AI

Meet the project team

Arunav Das

Research Fellow

Arunav Das is a Research Fellow at the ODI and a PhD Researcher at King's College London
Thomas Carey-Wilson

Researcher

Thomas works within the Research team exploring data governance practices throughout the AI data lifecycle.
Joe Massey

Senior Researcher

Joe is a Senior Researcher in the R&D team, and is currently focusing on the sustainable data access project.
Ben Snaith

Senior Researcher

Ben is a Senior Researcher at the ODI.
Elena Simperl

Director of Research, ODI

Elena Simperl is the Director of Research at the ODI and a Professor of Computer Science at King’s College London.
Sophia Worth

Researcher

Sophia is a Researcher at the ODI

What is data-centric AI and why is it important?

What our data-centric AI programme aims to achieve

Make data AI-ready

Make AI data accessible and usable

Make AI systems use data responsibly

Building on this work, in the next few months, we will:

Work in this programme

A framework for AI-ready data

Mapping the role of data work in AI supply chains

If DeepSeek wants to be a real disruptor, it should go much further on data transparency

Building transparent AI systems: our contribution to the EU General-Purpose AI Code of Practice

The UK must lead on data to unlock AI’s full potential

How an AI-ready National Data Library would help UK science

Unlocking data collaboration: A study on data sharing practices and developing standard data licence terms to promote access and social good

How the data for AI ecosystem is evolving

How to build a National Data Library

An AI-ready National Data Library

Learn from world experts on the data in AI

Building a better future with data and AI: a white paper

Global Policy Observatory for Data-centric AI

The ODI's vision for data-centric AI

The UK government as a data provider for AI

The AI Data Transparency Index

A data for AI taxonomy

Report series: 'Understanding data governance in AI'

AI data transparency: understanding the needs and current state of play

Transforming AI data governance with Croissant: a new standard for ML metadata

Democratising access to data: Bridging the data divide with generative AI models

Building a better future with data and AI: a white paper

Policy intervention 5: Empowering people to have more of a say in the sharing and use of data for AI

Policy intervention 4: Ensuring broad access to data for training AI models

Policy intervention 3: Enforcing people’s rights in the data supply chain

Policy intervention 2: Update our intellectual property regime to ensure AI models are trained fairly

Policy intervention 1: Increase transparency around the data used to train AI models

What do we mean by “without data, there is no AI”?

Data-centric AI webinar series

Learn from world experts on the data in AI

Meet the project team