MLCommons launches collaborative metadata for machine learning datasets

MLCommons is an AI engineering consortium which promotes open collaboration to improve AI systems.

The MLCommons initiative has unveiled Croissant, a metadata format designed to facilitate how machine learning (ML) practitioners interact with datasets.

The challenges in ML development are manifold, ranging from disparate data representations such as text, structured data, images, audio, and video, each with its unique arrangements and formats.

While existing metadata formats like schema.org and DCAT cater to general datasets, they fall short of meeting the specific needs of ML practitioners.

Croissant, a collaborative work within the MLCommons initiative, offers a standardised method to describe and organise ML-ready datasets.

Building upon the foundation of schema.org, Croissant introduces layers for ML-specific metadata, data resources, organisation, and default ML semantics.

Major ML platforms, including Kaggle, Hugging Face, and OpenML, along with frameworks like TensorFlow, PyTorch, and JAX, have announced their support for the Croissant format.

The 1.0 release of Croissant includes a comprehensive specification, example datasets, an open-source Python library for validation and generation of Croissant metadata, and a user-friendly visual editor for creating intuitive dataset descriptions.

In the realm of ML, where the majority of work revolves around data, the absence of a common format imposes a substantial data development burden.

Croissant aims to alleviate this burden by streamlining the ML development process, facilitating dataset discoverability, simplifying data cleaning and analysis, and enabling model training with minimal code.

Croissant datasets are already available on prominent platforms like Google Dataset Search, Hugging Face, Kaggle, and OpenML.

GlobalData Strategic Intelligence

US Tariffs are shifting - will you react or anticipate?

Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.

By GlobalData

Integration with TensorFlow Datasets allows for data ingestion, while the Croissant editor UI enables users to inspect and modify metadata.

To publish a Croissant dataset, creators can use the editor UI to generate metadata automatically, publish it on their dataset webpage, or leverage supported repositories.

MLCommons launches collaborative metadata for machine learning datasets

Go deeper with GlobalData

Machine Learning - Thematic Intelligence

Doc.ai Inc. - Tech Innovator Profile

Data Insights

US Tariffs are shifting - will you react or anticipate?

Machine Learning - Thematic Intelligence

Doc.ai Inc. - Tech Innovator Profile

Data Insights

Check-in to the smart hotel of the future

Flying bikes: The future of VTOL

Neon Ichiban hopes to take on Marvel and DC digital comics

Small springs, big impact: Disc springs for sustainable engineering

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

Go deeper with GlobalData

Data Insights

US Tariffs are shifting - will you react or anticipate?

Sign up for our daily news round-up!

Give your business an edge with our leading industry insights.

Go deeper with GlobalData

Data Insights

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing