-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use huggingface_hub
cache
#7105
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @lhoestq! 🎉 🎉 🎉
I did a first pass and left a few minor comments. Looks good!
src/datasets/load.py
Outdated
@@ -276,7 +276,11 @@ def increase_load_count(name: str): | |||
"""Update the download count of a dataset.""" | |||
if not config.HF_HUB_OFFLINE and config.HF_UPDATE_DOWNLOAD_COUNTS: | |||
try: | |||
head_hf_s3(name, filename=name + ".py") | |||
requests.head( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use huggingface_hub.utils.get_session().head(...)
to make HTTP requests instead of requests.head
.
It's a helper to return a unique session which keeps the connection open (quicker when consecutive calls) + check HF_HUB_OFFLINE
automatically + adds a request_id header to help debug things. Advanced users also have the possibility to customize the Session
settings, typically for proxies.
(I'm putting the comment here but it's the case for any requests.head
, requests.get
or requests.post
made by the datasets
library)
).resolve_path(url_or_filename) | ||
try: | ||
output_path = huggingface_hub.HfApi( | ||
endpoint=config.HF_ENDPOINT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint=config.HF_ENDPOINT, |
This is already the default value in huggingface_hub
(parsed from the same environment variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasets
users can modify config. HF_ENDPOINT
so I'd rather keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean users monkey-patching a constant value at runtime without using the environment variable? I feel this is not something we should promote/support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset-viewer
does it in its tests to switch between prod and testing endpoints :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ok, maybe a topic for a separate PR then. It still feels wrong to me to handle endpoints in various places (both in huggingface_hub
and in datasets
)
library_name="datasets", | ||
library_version=__version__, | ||
user_agent=get_datasets_user_agent(download_config.user_agent), | ||
).hf_hub_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top!
src/datasets/utils/file_utils.py
Outdated
@@ -1172,7 +913,7 @@ def _prepare_single_hop_path_and_storage_options( | |||
client_kwargs = storage_options.pop("client_kwargs", {}) | |||
storage_options["client_kwargs"] = {"trust_env": True, **client_kwargs} # Enable reading proxy env variables | |||
if "drive.google.com" in urlpath: | |||
response = http_head(urlpath) | |||
response = requests.head(urlpath, timeout=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about requests.head
vs get_session().head
. Maybe worth doing a pass on the datasets
codebase in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and increase_load_count ()
and the viewer's calls are the only uses of requests
in datasets
left :) I'll use get_session().head
though it's not a big deal imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great to know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lhoestq! I thought the switch would be a much more complex process but happy to realize it's not! 😄 I re-reviewed the PR and it looks good to me -to the extent of my knowledgeable-. Better to have other pairs of eyes for this one :)
fyi the CI failure on test_py310_numpy2 is unrelated to this PR (it's a dependency install failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI error on test_py310_numpy2 has been temporarily fixed by:
src/datasets/commands/dummy_data.py
Outdated
@@ -1,468 +0,0 @@ | |||
import fnmatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of deprecated code has been addressed in a separate dedicated PR:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool ! I'll resolve the conflicts and merge :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
yay! is this in a shipped release? |
we can do one in the coming days once @albertvillanova is back |
We have made a release and this feature is now included. |
hf_hub_download()
fromhuggingface_hub
for HF filesdatasets
cache_dir is still used for:Dataset
objects)