-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded downloads #6794
Multithreaded downloads #6794
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI is failing because of the missing parquet export of one test dataset, PR to fix this at huggingface/dataset-viewer#2689 |
.github/workflows/ci.yml
Outdated
@@ -54,7 +54,7 @@ jobs: | |||
if: ${{ matrix.os == 'ubuntu-latest' }} | |||
run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2" | |||
- name: Install uv | |||
run: pip install --upgrade uv | |||
run: pip install uv==0.1.29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would remove the pin to be consistent with huggingface_hub
and diffusers
:
run: pip install uv==0.1.29 |
(we don't use uv
's advanced/experimental features, so a breaking change here is unlikely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had pinned it because 0.1.30 had bugs - I'll see if 0.1.31 has fixed them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been fixed in 0.1.31 (issue in uv
: astral-sh/uv#2941) :)
.github/workflows/ci.yml
Outdated
@@ -89,7 +89,7 @@ jobs: | |||
- name: Upgrade pip | |||
run: python -m pip install --upgrade pip | |||
- name: Install uv | |||
run: pip install --upgrade uv | |||
run: pip install uv==0.1.29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here:
run: pip install uv==0.1.29 |
src/datasets/data_files.py
Outdated
download_config: Optional[DownloadConfig] = None, | ||
max_workers: int = 16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a config variable (which we would also use in DownloadManager
)
) | ||
else: | ||
return [ | ||
self._download(url_or_filename, download_config=download_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rename this method to _download_single
I took your comments into account :) lmk what you think @mariosasko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
...for faster dataset download when there are many many small files (e.g. imagefolder, audiofolder)
Behcnmark
for example on lhoestq/tmp-images-writer_batch_size (128 images)
load_dataset()
This should fix issues with the Dataset Viewer taking too much time to show up for imagefolder/audiofolder datasets.
Implementation details
The main change is in the
DownloadManager
:and
_download_batched
is a multithreaded function.I only enable multithreading if there are more than 16 files and files are small though, otherwise the progress bar that counts the number of downloaded files is not fluid (updating when a big batch of big files are done downloading). To do so I simply check if the first file is smaller than 20MB.
I also had to tweak
map_nested
to support batching. In particular it slices the data correctly if the user also enables multiprocessing.