-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix outdated verification_mode
values
#5607
Fix outdated verification_mode
values
#5607
Conversation
The documentation is not available anymore as the PR was closed or merged. |
…erna/datasets into do-not-save-info-when-no-checks
…hange values again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we need the dataset info
src/datasets/builder.py
Outdated
if os.path.exists(self._cache_dir) > 0: # check if data exist | ||
if len(os.listdir(self._cache_dir)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if os.path.exists(self._cache_dir) > 0: # check if data exist | |
if len(os.listdir(self._cache_dir)): | |
if os.path.exists(self._cache_dir): # check if data exist | |
if len(os.listdir(self._cache_dir)) > 0: |
src/datasets/builder.py
Outdated
if verification_mode is not VerificationMode.NO_CHECKS: | ||
self._save_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always save the info. For example it stores the feature types and the arrow shards lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm true...
what should we do then? I don't think it's fine that we get different results running the same command, like:
load_dataset("ds_name") # error - failed verifications
# then
load_dataset("ds_name", verification_mode="no_checks") # successfully loaded
# then
load_dataset("ds_name") # successfully loaded despite it was an error before
am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do anything tbh - e.g. reloading a cached dataset works even if you don't have the downloaded raw files anymore so we can't check for their checksums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay reverted it.
But I also changed incorrect values of verification_mode
in two places, from "none" -> "no_checks", "all" -> "all_checks" (they weren't updated in the previous verification PR #5303), do you agree with this change?
verification_mode
values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the fix for the values is definitely welcome, thank you :)
if len(os.listdir(self._cache_dir)) > 0 and os.path.exists( | ||
path_join(self._cache_dir, config.DATASET_INFO_FILENAME) | ||
): | ||
logger.info("Overwrite dataset info from restored data version if exists.") | ||
self.info = DatasetInfo.from_directory(self._cache_dir) | ||
else: # dir exists but no data, remove the empty dir as data aren't available anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the directory only if it's completely empty - but with your change it would remove it if the dataset info file is not there. This is in case some data fiels are still left somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for noticing, fixed it, can I merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
I think it makes sense not to savedataset_info.json
file to a dataset cache directory when loading dataset withverification_mode="no_checks"
because otherwise when next time the dataset is loaded withoutverification_mode="no_checks"
, it will be loaded successfully, despite some values in info might not correspond to the ones in the repo which was the reason for usingverification_mode="no_checks"
first.Updated values of
verification_mode
to the current ones in some places ("none" -> "no_checks", "all" -> "all_checks")