This document is intended for developers who want to install, test or contribute to the code.
To start working on the project:
git clone git@github.com:huggingface/datasets-server.git
cd datasets-server
Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)
make start
To install a single job (in jobs), library (in libs), service (in services) or worker (in workers), go to their respective directory, and install Python 3.9 (consider pyenv) and poetry (don't forget to add poetry
to the PATH
environment variable).
If you use pyenv:
cd libs/libcommon/
pyenv install 3.9.15
pyenv local 3.9.15
poetry env use python3.9
then:
make install
It will create a virtual environment in a ./.venv/
subdirectory.
If you use VSCode, it might be useful to use the "monorepo" workspace (see a blogpost for more explanations). It is a multi-root workspace, with one folder for each library and service (note that we hide them from the ROOT to avoid editing there). Each folder has its own Python interpreter, with access to the dependencies installed by Poetry. You might have to manually select the interpreter in every folder though on first access, then VSCode stores the information in its local storage.
The repository is structured as a monorepo, with Python libraries and applications in jobs), libs, services and workers:
- jobs contains the one-time jobs run by Helm before deploying the pods. For now, the only job migrates the databases when needed.
- libs contains the Python libraries used by the services and workers. For now, the only library is libcommon, which contains the common code for the services and workers.
- services contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point) and the reverse proxy.
- workers contains the workers that process the queue asynchronously: they get a "job" (caution: not the Helm jobs, but the jobs stored in the queue), process the expected response for the associated endpoint, and store the response in the cache.
If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.
The application is distributed in several components.
api is a web server that exposes the API endpoints. Apart from some endpoints (valid
, is-valid
), all the responses are served from pre-computed responses. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
The precomputed responses are stored in a Mongo database called "cache". They are computed by workers which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see libcommon).
The API service exposes the /webhook
endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
Note that every worker has its own job queue:
/splits
: the job is to refresh a dataset, namely to get the list of config and split names, then to create a new job for every split for the workers that depend on it./first-rows
: the job is to get the columns and the first 100 rows of the split./parquet
: the job is to download the dataset, prepare a parquet version of every split (various sharded parquet files), and upload them to theref/convert/parquet
"branch" of the dataset repository on the Hub.
Note also that the workers create local files when the dataset contains images or audios. A shared directory (ASSETS_STORAGE_DIRECTORY
) must therefore be provisioned with sufficient space for the generated files. The /first-rows
endpoint responses contain URLs to these files, served by the API under the /assets/
endpoint.
Hence, the working application has:
- one instance of the API service which exposes a port
- N1 instances of the
splits
worker, N2 instances of thefirst-rows
worker (N2 should generally be higher than N1), N3 instances of theparquet
worker - a Mongo server with two databases: "cache" and "queue"
- a shared directory for the assets
The application also has:
- a reverse proxy in front of the API to serve static files and proxy the rest to the API server
- an admin server to serve technical endpoints
The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database.
Environment | URL | Type | How to deploy |
---|---|---|---|
Production | https://datasets-server.huggingface.co | Helm / Kubernetes | make upgrade-prod in chart |
Development | https://datasets-server.us.dev.moon.huggingface.tech | Helm / Kubernetes | make upgrade-dev in chart |
Local from remote images | http://localhost:8100 | Docker compose | make start-from-remote-images (fetches docker images from Docker Hub) |
Local build | http://localhost:8000 | Docker compose | make start-from-local-code (builds docker images) |
The CI checks the quality of the code through a GitHub action. To manually format the code of a job, library, service or worker:
make style
To check the quality (which includes checking the style, but also security vulnerabilities):
make quality
The CI checks the tests a GitHub action. To manually test a job, library, service or worker:
make test
Note that it requires the resources to be ready, ie. mongo and the storage for assets.
To launch the end to end tests:
make e2e
We version the libraries as they are dependencies of the services. To update a library:
- change the version in its pyproject.yaml file
- build with
make build
- version the new files in
dist/
And then update the library version in the services that require the update, for example if the library is libcommon
:
poetry update libcommon
If service is updated, we don't update its version in the pyproject.yaml
file. But we have to update the docker images file with the new image tag. Then the CI will test the new docker images, and we will be able to deploy them to the infrastructure.
All the contributions should go through a pull request. The pull requests must be "squashed" (ie: one commit per pull request).
You can use act to test the GitHub Actions (see .github/workflows/) locally. It reduces the retroaction loop when working on the GitHub Actions, avoid polluting the branches with empty pushes only meant to trigger the CI, and allows to only run specific actions.
For example, to launch the build and push of the docker images to Docker Hub:
act -j build-and-push-image-to-docker-hub --secret-file my.secrets
with my.secrets
a file with the secrets:
DOCKERHUB_USERNAME=xxx
DOCKERHUB_PASSWORD=xxx
GITHUB_TOKEN=xxx
To install the datasets based worker on Mac OS, you can follow the next steps.
Install brew:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install ICU:
$ brew install icu4c
==> Caveats
icu4c is keg-only, which means it was not symlinked into /opt/homebrew,
because macOS provides libicucore.dylib (but nothing else).
If you need to have icu4c first in your PATH, run:
echo 'export PATH="/opt/homebrew/opt/icu4c/bin:$PATH"' >> ~/.zshrc
echo 'export PATH="/opt/homebrew/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
For compilers to find icu4c you may need to set:
export LDFLAGS="-L/opt/homebrew/opt/icu4c/lib"
export CPPFLAGS="-I/opt/homebrew/opt/icu4c/include"
Add ICU to the path:
$ echo 'export PATH="/opt/homebrew/opt/icu4c/bin:$PATH"' >> ~/.zshrc
$ echo 'export PATH="/opt/homebrew/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
Install pyenv:
$ curl https://pyenv.run | bash
append the following lines to ~/.zshrc:
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
Logout and login again.
Install Python 3.9.15:
$ pyenv install 3.9.15
Check that the expected local version of Python is used:
$ cd workers/datasets_based
$ python --version
Python 3.9.15
Install poetry:
curl -sSL https://install.python-poetry.org | python3 -
append the following lines to ~/.zshrc:
export PATH="/Users/slesage2/.local/bin:$PATH"
Install rust:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
Set the python version to use with poetry:
poetry env use 3.9.15
Avoid an issue with Apache beam (python-poetry/poetry#4888 (comment)):
poetry config experimental.new-installer false
Install the dependencies:
make install