Multithreaded downloads #6794

lhoestq · 2024-04-09T11:13:19Z

...for faster dataset download when there are many many small files (e.g. imagefolder, audiofolder)

Behcnmark

for example on lhoestq/tmp-images-writer_batch_size (128 images)

	duration of the download step in `load_dataset()`
Before	58s
Now	3s

This should fix issues with the Dataset Viewer taking too much time to show up for imagefolder/audiofolder datasets.

Implementation details

The main change is in the DownloadManager:

- download_func = partial(self._download, download_config=download_config)
+ download_func = partial(self._download_batched, download_config=download_config)
downloaded_path_or_paths = map_nested(
    download_func,
    url_or_urls,
    map_tuple=True,
    num_proc=download_config.num_proc,
    desc="Downloading data files",
+   batched=True,
+   batch_size=-1,
)

and _download_batched is a multithreaded function.

I only enable multithreading if there are more than 16 files and files are small though, otherwise the progress bar that counts the number of downloaded files is not fluid (updating when a big batch of big files are done downloading). To do so I simply check if the first file is smaller than 20MB.

I also had to tweak map_nested to support batching. In particular it slices the data correctly if the user also enables multiprocessing.

HuggingFaceDocBuilderDev · 2024-04-09T11:17:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-04-09T18:52:32Z

CI is failing because of the missing parquet export of one test dataset, PR to fix this at huggingface/dataset-viewer#2689

mariosasko · 2024-04-12T14:21:59Z

.github/workflows/ci.yml

@@ -54,7 +54,7 @@ jobs:
        if: ${{ matrix.os == 'ubuntu-latest' }}
        run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2"
      - name: Install uv
-        run: pip install --upgrade uv
+        run: pip install uv==0.1.29


Would remove the pin to be consistent with huggingface_hub and diffusers:

Suggested change

run: pip install uv==0.1.29

(we don't use uv's advanced/experimental features, so a breaking change here is unlikely)

I had pinned it because 0.1.30 had bugs - I'll see if 0.1.31 has fixed them

It's been fixed in 0.1.31 (issue in uv: astral-sh/uv#2941) :)

mariosasko · 2024-04-12T14:22:09Z

.github/workflows/ci.yml

@@ -89,7 +89,7 @@ jobs:
      - name: Upgrade pip
        run: python -m pip install --upgrade pip
      - name: Install uv
-        run: pip install --upgrade uv
+        run: pip install uv==0.1.29


Same here:

Suggested change

run: pip install uv==0.1.29

mariosasko · 2024-04-12T14:22:42Z

src/datasets/data_files.py

    download_config: Optional[DownloadConfig] = None,
+    max_workers: int = 16,


Maybe this should be a config variable (which we would also use in DownloadManager)

mariosasko · 2024-04-12T14:28:21Z

src/datasets/download/download_manager.py

+            )
+        else:
+            return [
+                self._download(url_or_filename, download_config=download_config)


Would rename this method to _download_single

lhoestq · 2024-04-15T16:13:07Z

I took your comments into account :) lmk what you think @mariosasko

mariosasko

LGTM!

github-actions · 2024-04-15T21:24:12Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004956 / 0.011353 (-0.006397)	0.003282 / 0.011008 (-0.007726)	0.064028 / 0.038508 (0.025520)	0.030420 / 0.023109 (0.007311)	0.240097 / 0.275898 (-0.035801)	0.266356 / 0.323480 (-0.057124)	0.003116 / 0.007986 (-0.004869)	0.002597 / 0.004328 (-0.001731)	0.050230 / 0.004250 (0.045980)	0.043864 / 0.037052 (0.006812)	0.258711 / 0.258489 (0.000222)	0.290816 / 0.293841 (-0.003025)	0.027898 / 0.128546 (-0.100648)	0.009941 / 0.075646 (-0.065705)	0.208917 / 0.419271 (-0.210355)	0.035891 / 0.043533 (-0.007642)	0.253332 / 0.255139 (-0.001807)	0.274300 / 0.283200 (-0.008900)	0.019466 / 0.141683 (-0.122217)	1.133896 / 1.452155 (-0.318259)	1.178130 / 1.492716 (-0.314586)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091093 / 0.018006 (0.073087)	0.293632 / 0.000490 (0.293142)	0.000216 / 0.000200 (0.000016)	0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.017722 / 0.037411 (-0.019689)	0.060241 / 0.014526 (0.045715)	0.072024 / 0.176557 (-0.104533)	0.118521 / 0.737135 (-0.618615)	0.071107 / 0.296338 (-0.225232)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.280950 / 0.215209 (0.065741)	2.781361 / 2.077655 (0.703706)	1.477949 / 1.504120 (-0.026171)	1.356388 / 1.541195 (-0.184807)	1.361808 / 1.468490 (-0.106682)	0.565499 / 4.584777 (-4.019278)	2.389206 / 3.745712 (-1.356506)	2.712782 / 5.269862 (-2.557079)	1.701402 / 4.565676 (-2.864274)	0.063619 / 0.424275 (-0.360656)	0.005321 / 0.007607 (-0.002286)	0.336783 / 0.226044 (0.110739)	3.299628 / 2.268929 (1.030699)	1.794686 / 55.444624 (-53.649939)	1.504207 / 6.876477 (-5.372270)	1.524637 / 2.142072 (-0.617436)	0.642833 / 4.805227 (-4.162395)	0.117808 / 6.500664 (-6.382856)	0.041539 / 0.075469 (-0.033930)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.960193 / 1.841788 (-0.881595)	11.229147 / 8.074308 (3.154839)	9.380653 / 10.191392 (-0.810739)	0.137184 / 0.680424 (-0.543240)	0.013399 / 0.534201 (-0.520802)	0.314904 / 0.579283 (-0.264379)	0.262539 / 0.434364 (-0.171825)	0.354007 / 0.540337 (-0.186331)	0.451698 / 1.386936 (-0.935238)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005207 / 0.011353 (-0.006146)	0.003660 / 0.011008 (-0.007348)	0.049931 / 0.038508 (0.011423)	0.030918 / 0.023109 (0.007809)	0.271243 / 0.275898 (-0.004655)	0.295706 / 0.323480 (-0.027774)	0.004106 / 0.007986 (-0.003879)	0.002750 / 0.004328 (-0.001578)	0.048337 / 0.004250 (0.044086)	0.039944 / 0.037052 (0.002892)	0.284013 / 0.258489 (0.025524)	0.306827 / 0.293841 (0.012987)	0.029183 / 0.128546 (-0.099363)	0.010033 / 0.075646 (-0.065613)	0.058126 / 0.419271 (-0.361146)	0.032427 / 0.043533 (-0.011106)	0.276471 / 0.255139 (0.021332)	0.288428 / 0.283200 (0.005229)	0.017549 / 0.141683 (-0.124134)	1.142361 / 1.452155 (-0.309793)	1.184514 / 1.492716 (-0.308202)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090350 / 0.018006 (0.072344)	0.292511 / 0.000490 (0.292021)	0.000215 / 0.000200 (0.000015)	0.000041 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021572 / 0.037411 (-0.015840)	0.074310 / 0.014526 (0.059784)	0.086102 / 0.176557 (-0.090455)	0.123507 / 0.737135 (-0.613629)	0.087397 / 0.296338 (-0.208941)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294038 / 0.215209 (0.078829)	2.889662 / 2.077655 (0.812007)	1.591775 / 1.504120 (0.087655)	1.468815 / 1.541195 (-0.072379)	1.470226 / 1.468490 (0.001736)	0.574557 / 4.584777 (-4.010220)	2.481377 / 3.745712 (-1.264335)	2.763368 / 5.269862 (-2.506493)	1.713707 / 4.565676 (-2.851969)	0.064158 / 0.424275 (-0.360117)	0.005553 / 0.007607 (-0.002054)	0.353480 / 0.226044 (0.127436)	3.447689 / 2.268929 (1.178760)	1.975802 / 55.444624 (-53.468822)	1.673561 / 6.876477 (-5.202915)	1.637212 / 2.142072 (-0.504860)	0.640667 / 4.805227 (-4.164560)	0.114618 / 6.500664 (-6.386046)	0.038912 / 0.075469 (-0.036557)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.007581 / 1.841788 (-0.834207)	11.874250 / 8.074308 (3.799942)	10.312692 / 10.191392 (0.121300)	0.142705 / 0.680424 (-0.537719)	0.015438 / 0.534201 (-0.518763)	0.285919 / 0.579283 (-0.293364)	0.278223 / 0.434364 (-0.156141)	0.323806 / 0.540337 (-0.216531)	0.415007 / 1.386936 (-0.971929)

lhoestq added 3 commits April 9, 2024 12:43

multithreaded downloads

f72bffb

fix

b05e626

fix again

db97134

lhoestq added 5 commits April 9, 2024 17:17

fix tests

cd42a37

fix

e4a1d01

fix 16 workers

a2e921b

enable multithreading only for small files

bd5a779

pin uv

0ccb360

fix tests

6ccc067

lhoestq marked this pull request as ready for review April 12, 2024 09:40

lhoestq requested a review from mariosasko April 12, 2024 09:40

mariosasko reviewed Apr 12, 2024

View reviewed changes

lhoestq added 4 commits April 12, 2024 16:12

unpin uv

8e56b3e

add HF_DATASETS_MULTITHREADING_MAX_WORKERS

0ecc64f

rename _download_single

d8e31fb

minor

89c21fa

mariosasko approved these changes Apr 15, 2024

View reviewed changes

lhoestq merged commit 0f1f27c into main Apr 15, 2024
12 checks passed

lhoestq deleted the multithreaded-downloads branch April 15, 2024 21:18

This was referenced May 6, 2024

Download is broken for dict of dicts: FileNotFoundError #6869

Closed

Fix download for dict of dicts of URLs #6871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded downloads #6794

Multithreaded downloads #6794

lhoestq commented Apr 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 9, 2024

lhoestq commented Apr 9, 2024

mariosasko Apr 12, 2024

lhoestq Apr 12, 2024

mariosasko Apr 12, 2024 •

edited

Loading

mariosasko Apr 12, 2024

mariosasko Apr 12, 2024

mariosasko Apr 12, 2024

lhoestq commented Apr 15, 2024 •

edited

Loading

mariosasko left a comment

github-actions bot commented Apr 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

		download_config: Optional[DownloadConfig] = None,
		max_workers: int = 16,

Multithreaded downloads #6794

Multithreaded downloads #6794

Conversation

lhoestq commented Apr 9, 2024 • edited Loading

Behcnmark

Implementation details

HuggingFaceDocBuilderDev commented Apr 9, 2024

lhoestq commented Apr 9, 2024

mariosasko Apr 12, 2024

Choose a reason for hiding this comment

lhoestq Apr 12, 2024

Choose a reason for hiding this comment

mariosasko Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

mariosasko Apr 12, 2024

Choose a reason for hiding this comment

mariosasko Apr 12, 2024

Choose a reason for hiding this comment

mariosasko Apr 12, 2024

Choose a reason for hiding this comment

lhoestq commented Apr 15, 2024 • edited Loading

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Apr 9, 2024 •

edited

Loading

mariosasko Apr 12, 2024 •

edited

Loading

lhoestq commented Apr 15, 2024 •

edited

Loading