MLCommons has announced the release of Croissant, a metadata format to help standardise the documentation of machine learning (ML) datasets. Croissant is set to make a huge difference to data practices in AI - as AI practitioners adopt it to describe their datasets and more AI platforms support Croissant-annotated datasets. This promises to be a game changer in AI safety and ethics, where high-quality, well-documented datasets are essential

Currently, many ML datasets lack sufficient machine-readable documentation to allow people to use them responsibly. Without this information, finding, understanding, and using these datasets safely and ethically can be very time-consuming.

Croissant aims to make data more easily accessible and discoverable. It enables datasets to be loaded into different AI platforms without the need for reformatting. Users looking to publish a dataset in the Croissant format benefit from the ‘Croissant editor’, which allows them to easily inspect, create, or modify Croissant descriptions for their datasets. There is also the MLCroissant Python Library for programmatic support.

The ODI has been an early supporter of the initiative, with our Director of Research Prof Elena Simperl co-chairing the Croissant working group. Moving forward, the ODI will help to advance Croissant in several ways, including piloting and evaluating the standard on key ML datasets, and promoting Croissant to the wider AI/ML community, in particular in the UK and Europe.

The ODI has an extensive track record designing, evaluating, and promoting open data standards in multiple domains, including the UK Open Banking standard, the OpenActive standard, and the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) Data4Policy. Open standards and interoperable data infrastructure are at the core of the 15-point plan for our data-centric AI programme. Together with our work on data infrastructure, data stewardship and governance, we look forward to building a global community and fostering the adoption of Croissant.

Data is a critical element of any model's performance, and as some experts suggest, it will run out, making the need to harness it even more important. Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe, making an enormous contribution to the AI data ecosystem.
Prof Elena Simperl
Director of Research at the ODI, Professor of Computer Science at King’s College London and co-chair of the Croissant working group

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organisations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King's College London, the ODI, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, and TU Eindhoven.

You can join the Croissant Working Group, contribute to the GitHub repository, and download the Croissant Editor to implement the Croissant vocabulary on your existing datasets.