Use `huggingface_hub` cache #7105

lhoestq · 2024-08-15T14:45:22Z

use hf_hub_download() from huggingface_hub for HF files
datasets cache_dir is still used for:
- caching datasets as Arrow files (that back Dataset objects)
- extracted archives, uncompressed files
- files downloaded via http (datasets with scripts)
I removed code that were made for http files (and also the dummy_data / mock_download_manager stuff that happened to rely on them and have been legacy for a while now)

HuggingFaceDocBuilderDev · 2024-08-15T14:47:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/datasets/utils/file_utils.py

Wauplin

Thanks for working on this @lhoestq! 🎉 🎉 🎉

I did a first pass and left a few minor comments. Looks good!

Wauplin · 2024-08-16T14:59:51Z

src/datasets/load.py

@@ -276,7 +276,11 @@ def increase_load_count(name: str):
    """Update the download count of a dataset."""
    if not config.HF_HUB_OFFLINE and config.HF_UPDATE_DOWNLOAD_COUNTS:
        try:
-            head_hf_s3(name, filename=name + ".py")
+            requests.head(


Better to use huggingface_hub.utils.get_session().head(...) to make HTTP requests instead of requests.head.

It's a helper to return a unique session which keeps the connection open (quicker when consecutive calls) + check HF_HUB_OFFLINE automatically + adds a request_id header to help debug things. Advanced users also have the possibility to customize the Session settings, typically for proxies.

(I'm putting the comment here but it's the case for any requests.head, requests.get or requests.post made by the datasets library)

src/datasets/utils/file_utils.py

Wauplin · 2024-08-16T15:06:06Z

src/datasets/utils/file_utils.py

+            ).resolve_path(url_or_filename)
+            try:
+                output_path = huggingface_hub.HfApi(
+                    endpoint=config.HF_ENDPOINT,


Suggested change

endpoint=config.HF_ENDPOINT,

This is already the default value in huggingface_hub (parsed from the same environment variable)

datasets users can modify config. HF_ENDPOINT so I'd rather keep it

Do you mean users monkey-patching a constant value at runtime without using the environment variable? I feel this is not something we should promote/support

dataset-viewer does it in its tests to switch between prod and testing endpoints :p

ok ok, maybe a topic for a separate PR then. It still feels wrong to me to handle endpoints in various places (both in huggingface_hub and in datasets)

Wauplin · 2024-08-16T15:06:54Z

src/datasets/utils/file_utils.py

+                    library_name="datasets",
+                    library_version=__version__,
+                    user_agent=get_datasets_user_agent(download_config.user_agent),
+                ).hf_hub_download(


src/datasets/utils/file_utils.py

Wauplin · 2024-08-16T15:09:53Z

src/datasets/utils/file_utils.py

@@ -1172,7 +913,7 @@ def _prepare_single_hop_path_and_storage_options(
        client_kwargs = storage_options.pop("client_kwargs", {})
        storage_options["client_kwargs"] = {"trust_env": True, **client_kwargs}  # Enable reading proxy env variables
        if "drive.google.com" in urlpath:
-            response = http_head(urlpath)
+            response = requests.head(urlpath, timeout=10)


Same comment as above about requests.head vs get_session().head. Maybe worth doing a pass on the datasets codebase in a separate PR

this and increase_load_count () and the viewer's calls are the only uses of requests in datasets left :) I'll use get_session().head though it's not a big deal imo

great to know!

severo · 2024-08-19T13:25:08Z

Nice

Wauplin

Thanks @lhoestq! I thought the switch would be a much more complex process but happy to realize it's not! 😄 I re-reviewed the PR and it looks good to me -to the extent of my knowledgeable-. Better to have other pairs of eyes for this one :)

lhoestq · 2024-08-19T16:48:15Z

fyi the CI failure on test_py310_numpy2 is unrelated to this PR (it's a dependency install failure)

albertvillanova

The CI error on test_py310_numpy2 has been temporarily fixed by:

Temporarily pin numpy<2.1 to fix CI #7114

albertvillanova · 2024-08-21T08:44:57Z

src/datasets/commands/dummy_data.py

@@ -1,468 +0,0 @@
-import fnmatch


The removal of deprecated code has been addressed in a separate dedicated PR:

Remove deprecated code #6996: 5d4687e

cool ! I'll resolve the conflicts and merge :)

albertvillanova

Thanks.

github-actions · 2024-08-21T15:53:10Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005677 / 0.011353 (-0.005676)	0.004054 / 0.011008 (-0.006954)	0.063101 / 0.038508 (0.024592)	0.031665 / 0.023109 (0.008556)	0.243332 / 0.275898 (-0.032566)	0.271067 / 0.323480 (-0.052413)	0.004283 / 0.007986 (-0.003703)	0.002889 / 0.004328 (-0.001440)	0.049269 / 0.004250 (0.045018)	0.048707 / 0.037052 (0.011654)	0.258599 / 0.258489 (0.000110)	0.307715 / 0.293841 (0.013874)	0.029850 / 0.128546 (-0.098696)	0.012299 / 0.075646 (-0.063347)	0.207616 / 0.419271 (-0.211656)	0.037655 / 0.043533 (-0.005878)	0.246602 / 0.255139 (-0.008537)	0.268518 / 0.283200 (-0.014682)	0.018128 / 0.141683 (-0.123555)	1.181569 / 1.452155 (-0.270586)	1.250641 / 1.492716 (-0.242075)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.143911 / 0.018006 (0.125905)	0.305608 / 0.000490 (0.305118)	0.000250 / 0.000200 (0.000050)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019208 / 0.037411 (-0.018204)	0.062502 / 0.014526 (0.047976)	0.075896 / 0.176557 (-0.100661)	0.123422 / 0.737135 (-0.613713)	0.077311 / 0.296338 (-0.219028)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283108 / 0.215209 (0.067899)	2.783509 / 2.077655 (0.705855)	1.466358 / 1.504120 (-0.037762)	1.350989 / 1.541195 (-0.190206)	1.370517 / 1.468490 (-0.097973)	0.732706 / 4.584777 (-3.852071)	2.366710 / 3.745712 (-1.379002)	2.988913 / 5.269862 (-2.280949)	1.892204 / 4.565676 (-2.673473)	0.079077 / 0.424275 (-0.345198)	0.005158 / 0.007607 (-0.002449)	0.336620 / 0.226044 (0.110576)	3.423556 / 2.268929 (1.154628)	1.848732 / 55.444624 (-53.595892)	1.544996 / 6.876477 (-5.331480)	1.550051 / 2.142072 (-0.592022)	0.798235 / 4.805227 (-4.006993)	0.132945 / 6.500664 (-6.367719)	0.041785 / 0.075469 (-0.033684)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.963359 / 1.841788 (-0.878429)	11.699994 / 8.074308 (3.625686)	9.311998 / 10.191392 (-0.879394)	0.140493 / 0.680424 (-0.539931)	0.013834 / 0.534201 (-0.520367)	0.302569 / 0.579283 (-0.276714)	0.267377 / 0.434364 (-0.166987)	0.341093 / 0.540337 (-0.199244)	0.431941 / 1.386936 (-0.954995)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005744 / 0.011353 (-0.005608)	0.003668 / 0.011008 (-0.007340)	0.049837 / 0.038508 (0.011329)	0.032051 / 0.023109 (0.008941)	0.271725 / 0.275898 (-0.004173)	0.302612 / 0.323480 (-0.020867)	0.004455 / 0.007986 (-0.003531)	0.002816 / 0.004328 (-0.001512)	0.049036 / 0.004250 (0.044785)	0.041233 / 0.037052 (0.004181)	0.287900 / 0.258489 (0.029411)	0.326204 / 0.293841 (0.032363)	0.032027 / 0.128546 (-0.096519)	0.012033 / 0.075646 (-0.063613)	0.060822 / 0.419271 (-0.358449)	0.033830 / 0.043533 (-0.009703)	0.274855 / 0.255139 (0.019716)	0.294191 / 0.283200 (0.010992)	0.017979 / 0.141683 (-0.123704)	1.151353 / 1.452155 (-0.300801)	1.215384 / 1.492716 (-0.277333)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.102552 / 0.018006 (0.084546)	0.314148 / 0.000490 (0.313658)	0.000217 / 0.000200 (0.000017)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024565 / 0.037411 (-0.012846)	0.076968 / 0.014526 (0.062442)	0.087982 / 0.176557 (-0.088574)	0.129844 / 0.737135 (-0.607292)	0.091370 / 0.296338 (-0.204968)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.296767 / 0.215209 (0.081558)	2.910716 / 2.077655 (0.833062)	1.579526 / 1.504120 (0.075406)	1.453457 / 1.541195 (-0.087737)	1.466296 / 1.468490 (-0.002194)	0.728372 / 4.584777 (-3.856405)	0.963852 / 3.745712 (-2.781861)	2.946582 / 5.269862 (-2.323280)	1.936199 / 4.565676 (-2.629478)	0.078886 / 0.424275 (-0.345389)	0.005537 / 0.007607 (-0.002071)	0.346315 / 0.226044 (0.120270)	3.440774 / 2.268929 (1.171845)	1.937549 / 55.444624 (-53.507076)	1.649507 / 6.876477 (-5.226970)	1.653386 / 2.142072 (-0.488686)	0.806598 / 4.805227 (-3.998629)	0.133384 / 6.500664 (-6.367280)	0.040552 / 0.075469 (-0.034917)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.030515 / 1.841788 (-0.811272)	12.129888 / 8.074308 (4.055580)	10.287069 / 10.191392 (0.095677)	0.141512 / 0.680424 (-0.538912)	0.015483 / 0.534201 (-0.518718)	0.300053 / 0.579283 (-0.279230)	0.120825 / 0.434364 (-0.313539)	0.342681 / 0.540337 (-0.197656)	0.470616 / 1.386936 (-0.916320)

julien-c · 2024-09-02T13:47:33Z

yay! is this in a shipped release?

lhoestq · 2024-09-02T14:10:37Z

we can do one in the coming days once @albertvillanova is back

albertvillanova · 2024-09-12T04:36:07Z

We have made a release and this feature is now included.

lhoestq added 3 commits August 15, 2024 16:43

use hfh cache

04159a4

remove unused mock download manager

216b5cf

fix remaining http calls

6a78c8f

lhoestq added 2 commits August 15, 2024 16:51

remove test line

47d0d58

use the hfh lib cache_dir for hf_hub_download

3bcdc29

lhoestq commented Aug 15, 2024

View reviewed changes

src/datasets/utils/file_utils.py Show resolved Hide resolved

lhoestq added 3 commits August 16, 2024 16:00

bump hfh minimum version

cb59365

update tests

93cfc87

style

35b67a2

Wauplin reviewed Aug 16, 2024

View reviewed changes

lhoestq added 3 commits August 16, 2024 17:19

again

e124f66

lucain's comments

99391e9

fix tests

26cf9d2

lhoestq marked this pull request as ready for review August 16, 2024 15:55

lhoestq requested a review from albertvillanova August 16, 2024 15:56

minor

2e70850

lhoestq force-pushed the use-hfh-cache branch from d695155 to 2e70850 Compare August 19, 2024 11:02

update offline test

a09974f

Wauplin approved these changes Aug 19, 2024

View reviewed changes

lhoestq added 3 commits August 19, 2024 16:57

don't test time out on old hfh

9641549

minor

5e0f49d

disable some tests on windows

555e48a

severo mentioned this pull request Aug 20, 2024

Imagefolder: UnexpectedError with root cause: "[Errno 13] Permission denied: '/tmp/hf-datasets-cache/medium/datasets/....incomplete'" huggingface/dataset-viewer#3027

Closed

albertvillanova reviewed Aug 21, 2024

View reviewed changes

albertvillanova approved these changes Aug 21, 2024

View reviewed changes

Wauplin mentioned this pull request Aug 21, 2024

huggingface-cli scan-cache doesn't capture cached datasets huggingface/huggingface_hub#2218

Open

lhoestq added 3 commits August 21, 2024 17:19

Merge branch 'main' into use-hfh-cache

8f969cd

update docs

8c73011

typo

85e9302

lhoestq merged commit 2878019 into main Aug 21, 2024
15 checks passed

lhoestq deleted the use-hfh-cache branch August 21, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `huggingface_hub` cache #7105

Use `huggingface_hub` cache #7105

lhoestq commented Aug 15, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 15, 2024

Wauplin left a comment •

edited

Loading

Wauplin Aug 16, 2024

Wauplin Aug 16, 2024

lhoestq Aug 16, 2024

Wauplin Aug 16, 2024

lhoestq Aug 16, 2024

Wauplin Aug 16, 2024

Wauplin Aug 16, 2024

Wauplin Aug 16, 2024

lhoestq Aug 16, 2024 •

edited

Loading

Wauplin Aug 16, 2024

severo commented Aug 19, 2024

Wauplin left a comment

lhoestq commented Aug 19, 2024

albertvillanova left a comment

albertvillanova Aug 21, 2024

lhoestq Aug 21, 2024

albertvillanova left a comment

github-actions bot commented Aug 21, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

julien-c commented Sep 2, 2024

lhoestq commented Sep 2, 2024

albertvillanova commented Sep 12, 2024

Use huggingface_hub cache #7105

Use huggingface_hub cache #7105

Conversation

lhoestq commented Aug 15, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Aug 15, 2024

Wauplin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

severo commented Aug 19, 2024

Wauplin left a comment

Choose a reason for hiding this comment

lhoestq commented Aug 19, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 21, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

julien-c commented Sep 2, 2024

lhoestq commented Sep 2, 2024

albertvillanova commented Sep 12, 2024

Use `huggingface_hub` cache #7105

Use `huggingface_hub` cache #7105

lhoestq commented Aug 15, 2024 •

edited

Loading

Wauplin left a comment •

edited

Loading

lhoestq Aug 16, 2024 •

edited

Loading