Croissant

Tue Nov 26, 2024

Many machine-learning (ML) datasets lack sufficient machine-readable documentation to allow people to use them responsibly. Without this information, finding, understanding, and using these datasets safely and ethically can be very time-consuming.

In March 2024, MLCommons announced the release of Croissant, a metadata format to help standardise the documentation of ML datasets. Croissant is set to make a huge difference to data practices in AI - as AI practitioners adopt it to describe their datasets and more AI platforms support Croissant-annotated datasets. This promises to be a game changer in AI safety and ethics, where high-quality, well-documented datasets are essential.

Croissant aims to make data more easily accessible and discoverable. It enables datasets to be loaded into different AI platforms without the need for reformatting. Users looking to publish a dataset in the Croissant format benefit from the ‘Croissant editor’, which allows them to easily inspect, create, or modify Croissant descriptions for their datasets. There is also the MLCroissant Python Library for programmatic support.

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organisations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King's College London, the ODI, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, and TU Eindhoven.

An introduction to Croissant

This content is not shown because you have denied third-party cookies. You can view it at https://youtu.be/qToOlmmtjlI?feature=shared, or update your cookie settings

Croissant: A Metadata Format for ML-Ready Datasets

Workshop paper - Croissant: A Metadata Format for ML-Ready Datasets

Croissant poster

About us

Our five year plan

What we do

Solid

Membership

Croissant

An introduction to Croissant

Related

Transforming AI data governance with Croissant: a new standard for ML metadata

The ODI to help develop an open metadata standard for machine learning data

Policy intervention 1: Increase transparency around the data used to train AI models

The ODI's vision for data-centric AI

A data for AI taxonomy

AI data transparency: understanding the needs and current state of play