The MLCommons initiative has unveiled Croissant, a metadata format designed to facilitate how machine learning (ML) practitioners interact with datasets.

The challenges in ML development are manifold, ranging from disparate data representations such as text, structured data, images, audio, and video, each with its unique arrangements and formats.

While existing metadata formats like schema.org and DCAT cater to general datasets, they fall short of meeting the specific needs of ML practitioners.

Croissant, a collaborative work within the MLCommons initiative, offers a standardised method to describe and organise ML-ready datasets.

Building upon the foundation of schema.org, Croissant introduces layers for ML-specific metadata, data resources, organisation, and default ML semantics.

Major ML platforms, including Kaggle, Hugging Face, and OpenML, along with frameworks like TensorFlow, PyTorch, and JAX, have announced their support for the Croissant format.

The 1.0 release of Croissant includes a comprehensive specification, example datasets, an open-source Python library for validation and generation of Croissant metadata, and a user-friendly visual editor for creating intuitive dataset descriptions.

In the realm of ML, where the majority of work revolves around data, the absence of a common format imposes a substantial data development burden.

Croissant aims to alleviate this burden by streamlining the ML development process, facilitating dataset discoverability, simplifying data cleaning and analysis, and enabling model training with minimal code.

Croissant datasets are already available on prominent platforms like Google Dataset Search, Hugging Face, Kaggle, and OpenML.

GlobalData Strategic Intelligence

US Tariffs are shifting - will you react or anticipate?

Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.

By GlobalData

Integration with TensorFlow Datasets allows for data ingestion, while the Croissant editor UI enables users to inspect and modify metadata.

To publish a Croissant dataset, creators can use the editor UI to generate metadata automatically, publish it on their dataset webpage, or leverage supported repositories.