15.1 Grids+Face-Blue-ArticleHeroBanner-1110x452-ODI-Research

To reap the benefits of AI organisations need to prepare and transform their data to make it directly useful and usable in AI tasks. This can take many forms, from creating metadata and knowledge graphs to native machine learning representations like embeddings and vector databases.

At the ODI we’ve been thinking about what it means for data to be AI ready. Earlier in 2025, we developed a framework to define the distinct aspects of what makes data ready to be used for AI. The original framework explored the specific criteria for data holders to follow, offering actionable recommendations for data, metadata, and infrastructure design. The ODI's updated framework identifies an additional dimension, to consider four critical dimensions of AI-readiness:

  • Dataset properties: standards compliance, semantic consistency, identifiable imbalances, de-identification, and appropriate file formats
  • Metadata: machine-readable formats, attached documentation, technical specifications, supply chain information, and legal clarity
  • Surrounding infrastructure: accessible portals, APIs, and version control
  • Governance: governance policy-as-code, documented roles and responsibilities, publicly identifiable points of contact, and clear data access processes.

These dimensions reflect a simple principle: that AI-ready data is data designed with transparency and technical usability for AI practitioners in mind.

We want this to be a living framework, validated and updated as we apply it in various contexts. This year, we used it in two projects. In a first project, a joint work with the Estonian company Nortal, we focused on data and real-life AI projects in UK local government. The analysis involved assessing ten high-impact use cases where local councils are exploring or piloting AI, in social care, health, traffic, fire, homelessness and environmental settings.

In the second project, we started from a series of open-data use cases from all over the world, which are part of our brand-new Open Data Use Case Observatory, developed in partnership with Microsoft. The case studies deep dive into open datasets and their impact, whether that came from use in AI or in other technological, data-led interventions.

The Observatory is a timely reminder that open data is absolutely essential in AI.

Open data underpins many of AI’s biggest advances: systems like AlphaFold are built on decades of openly shared protein structures, while biodiversity monitoring models depend on citizen-science datasets such as iNaturalist. Among others, chatbots are using open knowledge from Wikipedia.

Because it is accessible to anyone, open data goes some way to democratising AI. For example, it allows smaller labs and organisations to train and evaluate models without the need for significant amounts of public or proprietary data. It also enables transparency and accountability; when a healthcare model behaves unfairly, independent researchers can interrogate the underlying prescribing data, something that is impossible with closed datasets. Recent efforts by organisations like the Allen Institute of AI and Pleias to develop AI models using solely open data demonstrate the art of the possible. Not only is solely using open data to train AI models technically possible, but is also an affordable approach for organisations of all shapes and sizes.

Despite the importance of open data in the AI ecosystem, open data is not always AI ready. In this blog we summarise our findings from some research to analyse the AI-readiness of 10 datasets from the ODI’s Open Data Use Case Observatory.

Is open data ready for AI?

In the early days of open data, it was common to publish data openly, without thinking deeply about who might use it, how they might want to access it, and where they might have to look to find it. And just as it was during the early days of open data, simply making open datasets available online is not enough for them to be used to support AI development. While open data is not always published with a specific use case in mind, many of those datasets published before the advent of ChatGPT in 2022 are, understandably, highly unlikely to have considered what it means to be AI-ready. AI-ready datasets require high-quality metadata and data infrastructure, as well as robust data governance. All of which are essential for AI innovation.

Open data has lots of different uses for AI. Many will be familiar with some of the more common use cases for open data and AI, like training foundation models on massive open datasets (like trillion-token text corpora like CommonCorpus), and benchmarking performance against standardised evaluation sets. However, there are new emergent use cases for open data and AI that recent technological developments enable. Autonomous agents making real-time decisions need programmatic access to weather data, supply chains, and scientific observations, while answer engines powered by generative AI synthesise information across datasets through natural language queries. The trend is heading towards significant levels of data discovery happening through AI intermediaries rather than traditional portals or search engines like Google Dataset Search. There is a significant new opportunity in publishing open data in an AI ready way, so ensuring that data released openly is findable and accessible by these technologies is an important next step, as data publishers are seeking new ways to increase the use and impact of open data.

For this project, we analysed ten open datasets from the ODI’s Open Data Use Case Observatory that span different domains, data types and institutional contexts. They include geospatial and environmental data (Google Open Buildings, Copernicus climate products, the UK National LIDAR Programme), scientific and biomedical data (the AlphaFold Protein Structure Database), social and economic activity data (GitHub Innovation Graph, NHSBSA English Prescribing Data), language and speech resources (Mozilla Common Voice, CommonCorpus), biodiversity observations (iNaturalist), and commodity supply chain data (Trase Earth palm oil flows). For each dataset, we reviewed publicly available documentation, portals, APIs and code repositories, and assessed them against the ODI’s AI‑ready data framework. The following details key insights from the research:

There is an API divide amongst the datasets covered in this project, where some of the datasets are not accessible via API, limiting the usability of the data in the AI landscape. The NHSBSA English Prescribing Data API processed 6.3 million calls in 2023 versus 300,000 manual downloads, demonstrating a clear demand for data via an API.

When developing AI systems, having machine-readable metadata to discover and index datasets can be important for understanding the data. Many of the datasets we assessed lacked ML-specific metadata formats enabling automated discovery. In fact, our assessment found only CommonCorpus implementing contemporary metadata standards like Croissant.

In many of the datasets in our observatory, uncertainty data was undocumented. While some datasets, such as Google Open Buildings, explicitly quantify quality through confidence scores and precision-recall curves, others do not. AI systems struggle to communicate the nuances of uncertainty without explicit metadata, training models on biased data without understanding the bias compounds fairness problems.

While many of the datasets explicitly mentioned the different versions of publication, it was rare to see full versioning infrastructure, which aligns with our previous research across non-open datasets as well. Without formal version control, reproducing model training becomes impossible, and research built on specific data snapshots cannot be validated.

Most governance policies remain as documents rather than in machine-readable formats. Most datasets in the observatory have clear human-readable access processes, but governance is rarely machine-executable. AI models need to programmatically assess data provenance, license compatibility, and access requirements. Yet governance typically exists as PDF privacy policies rather than machine-readable rules. Two exceptions stood out. iNaturalist codifies curation rules as executable permissions in platform code, while Hugging Face CommonCorpus embeds license filtering directly in data processing pipelines.

So what does this mean for the future of open data?

Open data's potential for AI remains constrained by practical readiness gaps. Standards and practices necessary for AI-readiness are emerging, and leading datasets show that this is achievable and it does not require a huge investment. The barrier is organisational - implementing best practices requires deliberate choice and sustained effort, which can prove difficult for organisations without a clear AI strategy.

AI-ready data work cannot be an afterthought for data publishers. Just as a significant portion of a data science project tends to be data preparation, getting the data into the shape and format that is useful for AI tasks is significant, and takes both time and effort. For data publishers, investing in AI-readiness can lead to tangible returns. Datasets meeting these standards attract researcher attention, with a recent study showing the correlation between those datasets with completed documentation and high download rates. AI-ready data also enables more sophisticated applications, and extends the impact of data to important problems, from disease surveillance to supply chain transparency to climate modelling.

Furthermore, the principles underlying AI-readiness improve data quality for all users, not just AI practitioners. A dataset designed to be AI-ready, with explicit standards, transparent metadata, version control, and uncertainty quantifications, is by definition more discoverable, more interoperable, and more usable. In an era where interactions with data will more often than not be intermediated by AI, ensuring the data is AI-ready will be critical, where this is through AI agents, generative AI search, or even AI models supporting people to access and use data they may otherwise not have the skills to do.

If you’d like to learn more about our work on open data, the framework for AI-ready data, or our data-centric AI work more broadly, please do get in touch at [email protected].