Many machine-learning (ML) datasets lack sufficient machine-readable documentation to allow people to use them responsibly. Without this information, finding, understanding, and using these datasets safely and ethically can be very time-consuming.
In March 2024, MLCommons announced the release of Croissant, a metadata format to help standardise the documentation of ML datasets. Croissant is set to make a huge difference to data practices in AI - as AI practitioners adopt it to describe their datasets and more AI platforms support Croissant-annotated datasets. This promises to be a game changer in AI safety and ethics, where high-quality, well-documented datasets are essential.
Croissant aims to make data more easily accessible and discoverable. It enables datasets to be loaded into different AI platforms without the need for reformatting. Users looking to publish a dataset in the Croissant format benefit from the ‘Croissant editor’, which allows them to easily inspect, create, or modify Croissant descriptions for their datasets. There is also the MLCroissant Python Library for programmatic support.
Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organisations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King's College London, the ODI, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, and TU Eindhoven.
Croissant: A Metadata Format for ML-Ready Datasets
Workshop paper - Croissant: A Metadata Format for ML-Ready Datasets