Datasets is an open source project, so all contributions and suggestions are welcome.
You can contribute in many different ways: giving ideas, answering questions, reporting bugs, proposing enhancements, improving the documentation, fixing bugs,...
Many thanks in advance to every contributor.
In order to facilitate healthy, constructive behavior in an open and inclusive community, we all respect and abide by our code of conduct.
You have the list of open Issues at: https://github.com/huggingface/datasets/issues
Some of them may have the label help wanted
: that means that any contributor is welcomed!
If you would like to work on any of the open Issues:
-
Make sure it is not already assigned to someone else. You have the assignee (if any) on the top of the right column of the Issue page.
-
You can self-assign it by commenting on the Issue page with the keyword:
#self-assign
. -
Work on your self-assigned issue and eventually create a Pull Request.
If you want to add a dataset see specific instructions in the section How to add a dataset.
-
Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.
-
Clone your fork to your local disk, and add the base repository as a remote:
git clone git@github.com:<your Github handle>/datasets.git cd datasets git remote add upstream https://github.com/huggingface/datasets.git
-
Create a new branch to hold your development changes:
git checkout -b a-descriptive-name-for-my-changes
do not work on the
main
branch. -
Set up a development environment by running the following command in a virtual environment:
Simple setup with code formatting only (recommended)
pip install -e ".[quality]"
Advanced setup with all the optional dependencies
pip install -e ".[dev]"
(If datasets was already installed in the virtual environment, remove it with
pip uninstall datasets
before reinstalling it in editable mode with the-e
flag.) -
Develop the features on your branch.
-
Format your code. Run
black
andruff
so that your newly added files look nice with the following command:make style
-
(Optional) You can also use
pre-commit
to format your code automatically each time rungit commit
, instead of runningmake style
manually. To do this, installpre-commit
viapip install pre-commit
and then runpre-commit install
in the project's root directory to set up the hooks. Note that if any files were formatted bypre-commit
hooks during committing, you have to rungit commit
again . -
Once you're happy with your contribution, add your changed files and make a commit to record your changes locally:
git add -u git commit
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
git fetch upstream git rebase upstream/main
-
Once you are satisfied, push the changes to your fork repo using:
git push -u origin a-descriptive-name-for-my-changes
Go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.
You can share your dataset on https://huggingface.co/datasets directly using your account (no need to open a PR on GitHub), see the documentation:
Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the README.md
dataset cards provided for each dataset.
If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the README.md
file. We provide:
- a template
- a guide describing what information should go into each of the paragraphs
- and if you need inspiration, we recommend looking through a completed example
If you are a dataset author... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.
If you are a user of a dataset, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the Considerations for Using the Data based on existing scholarship or personal experience that would benefit the whole community.
Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works Datasheets for Datasets and Data Statements for NLP.
Thank you for your contribution!
This project adheres to the HuggingFace code of conduct. By participating, you are expected to abide by this code.