Template for datasheet for dataset

This is a template for a datasheet for dataset.

The template is here.

What is a datasheet for a dataset?

Datasheets for datasets were created to increase transparency of datasets.

[Datasheets for datasets] document [the dataset] motivation, composition, collection process, recommended uses, and so on. [They] have the potential to increase transparency and accountability within the machine learning community, mitigate unwanted biases in machine learning systems, facilitate greater reproducibility of machine learning results, and help researchers and practitioners select more appropriate datasets for their chosen tasks.

The problem it is trying to solve:

Despite the importance of data to machine learning, there is no standardized process for documenting machine learning datasets. To address this gap, we propose datasheets for datasets.

The datasheet is not a passive, after-the-fact document. Dataset creators are expected to read the questions in the motivation, composition, and collection process sections before they start collecting data for the dataset. The questions in these sections have considerations that cannot be easily rectified later if not taken into account before data is gathered. Similarly, the dataset creators are expected to read the questions in the preprocesssing/cleaning/labeling section before they preprocess the raw data.

Why use a markdown file for the datasheet?

The short explanation: using a markdown file allows us to compare (diff) easily one version of the datasheet with another version.

The longer explanation:

Datasets should be under version control, in the same way we put the code under version control. Once under version control, we can compare one version against the other.

It is easier to follow the changes in a dataset when its datasheet is distributed together with the dataset. If the dataset is under source control, so should be its datasheet.

Whenever there is a new version of the dataset, we also need to update its description. In other words, we need to update its datasheet.

The datasheet distributed with a version should be in a format that is easy to compare with previous versions, to allow us to quickly see what has been changed. Markdown is a simple, text format, making it ideal for that.

Examples of dataset datasheets

CheXpert
Moview review polarity, a supplement to the publication of the Datasheets for Datasets paper on the Communications of the ACM journal (the paper on arXiv).

Google has been using data cards to document its datasets. It is close to but not the same as the datasheet for dataset template. In the paper's words: "Data Cards complement other longer-form and domain-specific documentation frameworks for ethical reporting, such as Model Cards [22], Data Statements [8], and Datasheets for Datasets [14]." For example, this is Google's Open Images Extended - MIAP (paper) data card.

Models cards

If you are interested in datasheets for datasets, you may also want to review model cards.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md
datasheet-for-dataset-template.md		datasheet-for-dataset-template.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Template for datasheet for dataset

What is a datasheet for a dataset?

Why use a markdown file for the datasheet?

Examples of dataset datasheets

Models cards

About

Releases

Packages

License

ESBAAR-AI/datasheet-for-dataset-template

Folders and files

Latest commit

History

Repository files navigation

Template for datasheet for dataset

What is a datasheet for a dataset?

Why use a markdown file for the datasheet?

Examples of dataset datasheets

Models cards

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages