Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix outdated verification_mode values #5607

Merged

Conversation

polinaeterna
Copy link
Contributor

@polinaeterna polinaeterna commented Mar 3, 2023

I think it makes sense not to save dataset_info.json file to a dataset cache directory when loading dataset with verification_mode="no_checks" because otherwise when next time the dataset is loaded without verification_mode="no_checks", it will be loaded successfully, despite some values in info might not correspond to the ones in the repo which was the reason for using verification_mode="no_checks" first.

Updated values of verification_mode to the current ones in some places ("none" -> "no_checks", "all" -> "all_checks")

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 3, 2023

The documentation is not available anymore as the PR was closed or merged.

@polinaeterna polinaeterna requested a review from mariosasko March 3, 2023 20:13
@polinaeterna polinaeterna marked this pull request as ready for review March 3, 2023 20:13
@polinaeterna polinaeterna requested a review from lhoestq March 6, 2023 18:35
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we need the dataset info

Comment on lines 360 to 361
if os.path.exists(self._cache_dir) > 0: # check if data exist
if len(os.listdir(self._cache_dir)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if os.path.exists(self._cache_dir) > 0: # check if data exist
if len(os.listdir(self._cache_dir)):
if os.path.exists(self._cache_dir): # check if data exist
if len(os.listdir(self._cache_dir)) > 0:

Comment on lines 884 to 885
if verification_mode is not VerificationMode.NO_CHECKS:
self._save_info()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always save the info. For example it stores the feature types and the arrow shards lengths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm true...
what should we do then? I don't think it's fine that we get different results running the same command, like:

load_dataset("ds_name")  # error - failed verifications
# then
load_dataset("ds_name", verification_mode="no_checks")  # successfully loaded
# then
load_dataset("ds_name")  # successfully loaded despite it was an error before

am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do anything tbh - e.g. reloading a cached dataset works even if you don't have the downloaded raw files anymore so we can't check for their checksums

Copy link
Contributor Author

@polinaeterna polinaeterna Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay reverted it.

But I also changed incorrect values of verification_mode in two places, from "none" -> "no_checks", "all" -> "all_checks" (they weren't updated in the previous verification PR #5303), do you agree with this change?

@polinaeterna polinaeterna changed the title Don't save dataset info to cache dir when skipping verifications Fix outdated verification_mode values Mar 8, 2023
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the fix for the values is definitely welcome, thank you :)

if len(os.listdir(self._cache_dir)) > 0 and os.path.exists(
path_join(self._cache_dir, config.DATASET_INFO_FILENAME)
):
logger.info("Overwrite dataset info from restored data version if exists.")
self.info = DatasetInfo.from_directory(self._cache_dir)
else: # dir exists but no data, remove the empty dir as data aren't available anymore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the directory only if it's completely empty - but with your change it would remove it if the dataset info file is not there. This is in case some data fiels are still left somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing, fixed it, can I merge?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@polinaeterna polinaeterna merged commit 778d4e1 into huggingface:main Mar 9, 2023
@polinaeterna polinaeterna deleted the do-not-save-info-when-no-checks branch March 9, 2023 17:27
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006142 / 0.011353 (-0.005211) 0.004506 / 0.011008 (-0.006502) 0.100224 / 0.038508 (0.061715) 0.026988 / 0.023109 (0.003879) 0.301625 / 0.275898 (0.025727) 0.346337 / 0.323480 (0.022857) 0.004642 / 0.007986 (-0.003343) 0.003481 / 0.004328 (-0.000847) 0.075847 / 0.004250 (0.071597) 0.036959 / 0.037052 (-0.000094) 0.302697 / 0.258489 (0.044208) 0.351917 / 0.293841 (0.058076) 0.030719 / 0.128546 (-0.097828) 0.011591 / 0.075646 (-0.064056) 0.319709 / 0.419271 (-0.099563) 0.042000 / 0.043533 (-0.001532) 0.306854 / 0.255139 (0.051715) 0.326903 / 0.283200 (0.043703) 0.082711 / 0.141683 (-0.058972) 1.486616 / 1.452155 (0.034461) 1.603229 / 1.492716 (0.110513)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.198990 / 0.018006 (0.180983) 0.427733 / 0.000490 (0.427243) 0.003612 / 0.000200 (0.003412) 0.000071 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022932 / 0.037411 (-0.014480) 0.096969 / 0.014526 (0.082443) 0.105749 / 0.176557 (-0.070807) 0.166101 / 0.737135 (-0.571034) 0.108646 / 0.296338 (-0.187692)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.428174 / 0.215209 (0.212965) 4.271452 / 2.077655 (2.193797) 1.907588 / 1.504120 (0.403468) 1.680870 / 1.541195 (0.139675) 1.761336 / 1.468490 (0.292846) 0.700380 / 4.584777 (-3.884396) 3.415168 / 3.745712 (-0.330544) 1.886122 / 5.269862 (-3.383740) 1.276814 / 4.565676 (-3.288863) 0.083429 / 0.424275 (-0.340846) 0.012988 / 0.007607 (0.005381) 0.518821 / 0.226044 (0.292776) 5.188284 / 2.268929 (2.919356) 2.433084 / 55.444624 (-53.011540) 1.988034 / 6.876477 (-4.888443) 2.100275 / 2.142072 (-0.041797) 0.808252 / 4.805227 (-3.996976) 0.158102 / 6.500664 (-6.342562) 0.067686 / 0.075469 (-0.007783)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.204171 / 1.841788 (-0.637616) 13.548756 / 8.074308 (5.474448) 14.339805 / 10.191392 (4.148413) 0.142853 / 0.680424 (-0.537571) 0.016529 / 0.534201 (-0.517672) 0.383800 / 0.579283 (-0.195483) 0.380362 / 0.434364 (-0.054002) 0.437716 / 0.540337 (-0.102621) 0.524306 / 1.386936 (-0.862630)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006730 / 0.011353 (-0.004623) 0.004652 / 0.011008 (-0.006356) 0.077476 / 0.038508 (0.038968) 0.027584 / 0.023109 (0.004475) 0.340907 / 0.275898 (0.065009) 0.377950 / 0.323480 (0.054470) 0.005946 / 0.007986 (-0.002040) 0.003548 / 0.004328 (-0.000780) 0.076270 / 0.004250 (0.072019) 0.037483 / 0.037052 (0.000431) 0.346390 / 0.258489 (0.087901) 0.384739 / 0.293841 (0.090898) 0.031744 / 0.128546 (-0.096802) 0.011598 / 0.075646 (-0.064049) 0.085651 / 0.419271 (-0.333620) 0.047308 / 0.043533 (0.003775) 0.344704 / 0.255139 (0.089565) 0.363410 / 0.283200 (0.080211) 0.095009 / 0.141683 (-0.046674) 1.478307 / 1.452155 (0.026152) 1.576808 / 1.492716 (0.084092)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.197545 / 0.018006 (0.179539) 0.431984 / 0.000490 (0.431494) 0.001529 / 0.000200 (0.001329) 0.000079 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025452 / 0.037411 (-0.011959) 0.100176 / 0.014526 (0.085651) 0.108222 / 0.176557 (-0.068335) 0.160556 / 0.737135 (-0.576580) 0.112748 / 0.296338 (-0.183591)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.436326 / 0.215209 (0.221117) 4.378443 / 2.077655 (2.300788) 2.056001 / 1.504120 (0.551881) 1.853406 / 1.541195 (0.312211) 1.931645 / 1.468490 (0.463155) 0.698340 / 4.584777 (-3.886437) 3.368961 / 3.745712 (-0.376751) 2.583622 / 5.269862 (-2.686239) 1.501274 / 4.565676 (-3.064402) 0.083034 / 0.424275 (-0.341241) 0.012725 / 0.007607 (0.005117) 0.539991 / 0.226044 (0.313947) 5.418413 / 2.268929 (3.149485) 2.517205 / 55.444624 (-52.927420) 2.179332 / 6.876477 (-4.697144) 2.215376 / 2.142072 (0.073304) 0.806133 / 4.805227 (-3.999094) 0.151499 / 6.500664 (-6.349165) 0.067270 / 0.075469 (-0.008199)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.308324 / 1.841788 (-0.533464) 14.357361 / 8.074308 (6.283053) 14.684768 / 10.191392 (4.493376) 0.139575 / 0.680424 (-0.540849) 0.016409 / 0.534201 (-0.517792) 0.374087 / 0.579283 (-0.205196) 0.390628 / 0.434364 (-0.043735) 0.443102 / 0.540337 (-0.097235) 0.536089 / 1.386936 (-0.850847)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants