At the Open Data Institute (ODI), we believe that the foundation of trustworthy AI lies in AI-ready data. Croissant has already established itself among major adopters such as Hugging Face, Kaggle, and OpenML. So, we are thrilled to see new adopters such as Common Crawl, Encord, and HumanSignal joining the ecosystem with this latest release.
The industry often focuses on model properties. Yet, the real challenge for most organisations’ AI capabilities is ensuring that the data feeding these models (pre- or post-training) is high-quality and responsibly sourced. To meet these challenges, we've been working with our partners on socio-technical solutions to operationalise responsible data-centric AI. This includes tools to publish, access, and assure data for AI use cases, such as our recently released Croissant plugin for CKAN.
Croissant 1.0 established a standardised, machine-readable structure for dataset metadata. Today, we are proud to announce the release of Croissant 1.1, the latest evolution of the MLCommons metadata standard. This new release adds a range of features, including: machine-actionable provenance for complete data lineage, vocabulary interoperability to link metadata to domain-specific ontologies, structured usage policies for automated enforcement of consent and licensing, and enhanced data modelling for complex datasets. As co-authors of this update, the ODI is excited to see how Croissant 1.1 directly addresses the core pillars of our AI-ready data framework.
A tool for AI-readiness
The AI-ready data framework provides actionable criteria for data publishers to better prepare their data for AI. Our research defines AI-readiness across four core pillars: Dataset properties, Metadata, Surrounding infrastructure, and Governance. Croissant 1.1 is deeply related to this work as a tool that helps data publishers meet many of these requirements:
- Machine-actionable provenance: Croissant 1.1 enables a complete chain-of-custody by adopting the W3C PROV-O model. Users can trace a dataset’s lineage through entities, activities, and agents, making audits for data-centric AI a reality
- Structured governance: The standard now integrates the Data Use Ontology (DUO) and Open Digital Rights Language (ODRL), allowing datasets to carry machine-readable usage permissions (eg, "non-commercial use") directly in the metadata
- Semantic interoperability: New flexible vocabularies allow metadata to link to domain-specific ontologies (like Wikidata), ensuring that a dataset’s meaning is preserved across different platforms
Croissant in action: volunteering and social action
The impact of this standard runs through many of our projects. For instance, our current work with the Department for Culture, Media and Sport (DCMS) involves developing an open data infrastructure for volunteering in the United Kingdom (UK). Here, we have identified agentic search for volunteering opportunities as a priority use case for our pilots.
In turn, we have had to explore how our volunteering and social action ontology could be made AI-ready. For this reason, we are applying Croissant in this pilot, ensuring that volunteering opportunities are machine-readable, discoverable and interpretable by autonomous AI agents. This allows for intelligent matching that respects the provenance and permissions of sensitive community data.
Bridging the Gap with CKAN
To make these standards accessible, the ODI has worked to integrate Croissant support directly into CKAN, the world’s leading open-source data portal platform. Through the ckanext-dcat extension, CKAN instances can now natively expose ML dataset metadata in the Croissant format. This means:
- Automatic metadata generation: Site datasets are mapped to schema.org and Croissant resources.
- Field-level detail: Resources in the CKAN DataStore can expose Croissant RecordSet objects, detailing column names and types for immediate ML use
- Embedded provenance: Leveraging the PROV-O model, CKAN portals can now expose the complete chain of custody and lineage of an ML dataset, which is essential for trust and auditing
- Automated governance: CKAN instances can allow AI agents to automatically verify if a dataset is legally and ethically permitted for their specific use case by embedding DUO and ODRL terms directly into the portal's metadata
Looking ahead
As AI systems move toward autonomous agents, self-describing metadata is a prerequisite for trust. We are also actively collaborating with Encord in the Responsible AI (RAI) working group, with more details to be shared in an upcoming press release from MLCommons. With over 800,000 datasets already carrying Croissant metadata just on Hugging Face, alongside massive adoption on platforms like Kaggle, the standard is becoming the bedrock of the AI ecosystem. We encourage all data publishers to adopt Croissant 1.1 to ensure their data is not just findable, but truly AI-ready.