-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip dataset verifications by default #5303
Conversation
The documentation is not available anymore as the PR was closed or merged. |
100% agree that the checksum verification is overkill and not super useful. But I think this PR would also disable the check on num_examples no ? As a user I would like to know if the dataset I'm loading changed significantly. What do you think ? We could have a default |
Accepting multiple types (booleans and strings) at the same time is not the best design. Maybe we could define an enum for this parameter? |
Yes an enum sounds good ! |
so we can have three verification levels, - smth like "ignore_all" (to skip both checksums and all other info like num_examples verification), "ignore_checksums" (to skip only checksums verification), and "verify_all" (to perform all verification)? @mariosasko if you're not going to work on this PR in the coming days, I can take over it if you want (this PR will help me with this issue, not super urgent though). |
Okay, I propose deprecating class VerificationMode(enum.Enum):
FULL = "full" # runs all verification checks
BASIC = "basic" # default, runs only the cheap ones (skips the checksum check)
NONE = "none" # skips all the checks WDTY? |
(copy paste from my message on slack) What do you think of a config variable in config.py to switch from one verification mode to another ? This way we don’t deprecate anything Many users are familiar with ignore_verifications=True, it might be overkill to deprecate it |
@lhoestq So we have "basic" verification mode in I like having a The usage point seems also valid to me, but cases when users are stuck with NonMatchingX errors also happen from time to time and to figure out what's wrong is non-trivial here. As a note aside - I suggest to add instructions to the NonMatchingX error message (how to use |
Ok I see. I'm fine with the new parameter then (even though I had a small pref for the config variable) :) |
I like the idea of an enum and the In relation with the config parameter, we could additionally add a As a note aside, I like the suggestion by @polinaeterna: we could give actionable messages when verifying checksums. This could be done in other PR. |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This is ready for review. If PS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good ! thanks 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! I've left a couple of text suggestions and also a couple of questions.
I would also prefer to change the name for the NONE
verification mode, but don't have really good ideas in mind. maybe smth like SKIP_ALL
?
src/datasets/builder.py
Outdated
@@ -724,7 +740,7 @@ def download_and_prepare( | |||
self._output_dir = fs_token_paths[2][0] if is_local else self._fs.unstrip_protocol(fs_token_paths[2][0]) | |||
|
|||
download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS) | |||
verify_infos = not ignore_verifications | |||
verification_mode = VerificationMode(verification_mode or VerificationMode.BASIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a stupid question idk but why not VerificationMode(verification_mode) or VerificationMode.BASIC
(here and everywhere below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches work, so I guess this is a matter of style
src/datasets/utils/info_utils.py
Outdated
|
||
| | Verification checks | | ||
|--------------------|------------------------------------------------------------------------------ | | ||
| `FULL` | Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is uniqueness of the keys check depended on the verification_mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please point me to the place in the code? I can't find it.... 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the commit in which this check was introduced: beca084
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Polina Kazakova <polina@huggingface.co>
I decided to go with the following names:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you !
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Skip dataset verifications by default * Replace ignore_verifications with VerificationMode * Update io builders * Update commands * Update tests * Update docs * Misc * Fix bad copying in generator io builder * Update src/datasets/builder.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Polina Kazakova <polina@huggingface.co> * Rename values * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Style --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Polina Kazakova <polina@huggingface.co>
* Skip dataset verifications by default * Replace ignore_verifications with VerificationMode * Update io builders * Update commands * Update tests * Update docs * Misc * Fix bad copying in generator io builder * Update src/datasets/builder.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Polina Kazakova <polina@huggingface.co> * Rename values * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Style --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Polina Kazakova <polina@huggingface.co>
* Skip dataset verifications by default * Replace ignore_verifications with VerificationMode * Update io builders * Update commands * Update tests * Update docs * Misc * Fix bad copying in generator io builder * Update src/datasets/builder.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Polina Kazakova <polina@huggingface.co> * Rename values * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Style --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Polina Kazakova <polina@huggingface.co>
Skip the dataset verifications (split and checksum verifications, duplicate keys check) by default unless a dataset is being tested (
datasets-cli test/run_beam
). The main goal is to avoid running the checksum check in the default case due to how expensive it can be for large datasets.PS: Maybe we should deprecate
ignore_verifications
, which isTrue
now by default, and give it a different name?