Fix outdated `verification_mode` values #5607

polinaeterna · 2023-03-03T19:50:29Z

I think it makes sense not to save dataset_info.json file to a dataset cache directory when loading dataset with verification_mode="no_checks" because otherwise when next time the dataset is loaded without verification_mode="no_checks", it will be loaded successfully, despite some values in info might not correspond to the ones in the repo which was the reason for using verification_mode="no_checks" first.

Updated values of verification_mode to the current ones in some places ("none" -> "no_checks", "all" -> "all_checks")

…ecks'

HuggingFaceDocBuilderDev · 2023-03-03T19:55:57Z

The documentation is not available anymore as the PR was closed or merged.

…> 'all_checks'

…erna/datasets into do-not-save-info-when-no-checks

…hange values again

lhoestq

actually we need the dataset info

lhoestq · 2023-03-08T11:51:38Z

src/datasets/builder.py

+                if os.path.exists(self._cache_dir) > 0:  # check if data exist
+                    if len(os.listdir(self._cache_dir)):


Suggested change

if os.path.exists(self._cache_dir) > 0: # check if data exist

if len(os.listdir(self._cache_dir)):

if os.path.exists(self._cache_dir): # check if data exist

if len(os.listdir(self._cache_dir)) > 0:

lhoestq · 2023-03-08T11:53:36Z

src/datasets/builder.py

+                    if verification_mode is not VerificationMode.NO_CHECKS:
+                        self._save_info()


We should always save the info. For example it stores the feature types and the arrow shards lengths.

hm true...
what should we do then? I don't think it's fine that we get different results running the same command, like:

load_dataset("ds_name") # error - failed verifications # then load_dataset("ds_name", verification_mode="no_checks") # successfully loaded # then load_dataset("ds_name") # successfully loaded despite it was an error before

am I missing something?

I don't think we can do anything tbh - e.g. reloading a cached dataset works even if you don't have the downloaded raw files anymore so we can't check for their checksums

okay reverted it.

But I also changed incorrect values of verification_mode in two places, from "none" -> "no_checks", "all" -> "all_checks" (they weren't updated in the previous verification PR #5303), do you agree with this change?

lhoestq

Yea the fix for the values is definitely welcome, thank you :)

lhoestq · 2023-03-08T12:35:04Z

src/datasets/builder.py

+                    if len(os.listdir(self._cache_dir)) > 0 and os.path.exists(
+                        path_join(self._cache_dir, config.DATASET_INFO_FILENAME)
+                    ):
+                        logger.info("Overwrite dataset info from restored data version if exists.")
                        self.info = DatasetInfo.from_directory(self._cache_dir)
                    else:  # dir exists but no data, remove the empty dir as data aren't available anymore


I think we should remove the directory only if it's completely empty - but with your change it would remove it if the dataset info file is not there. This is in case some data fiels are still left somehow

thanks for noticing, fixed it, can I merge?

lhoestq

Thanks :)

mariosasko

Good catch!

github-actions · 2023-03-09T17:34:12Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006142 / 0.011353 (-0.005211)	0.004506 / 0.011008 (-0.006502)	0.100224 / 0.038508 (0.061715)	0.026988 / 0.023109 (0.003879)	0.301625 / 0.275898 (0.025727)	0.346337 / 0.323480 (0.022857)	0.004642 / 0.007986 (-0.003343)	0.003481 / 0.004328 (-0.000847)	0.075847 / 0.004250 (0.071597)	0.036959 / 0.037052 (-0.000094)	0.302697 / 0.258489 (0.044208)	0.351917 / 0.293841 (0.058076)	0.030719 / 0.128546 (-0.097828)	0.011591 / 0.075646 (-0.064056)	0.319709 / 0.419271 (-0.099563)	0.042000 / 0.043533 (-0.001532)	0.306854 / 0.255139 (0.051715)	0.326903 / 0.283200 (0.043703)	0.082711 / 0.141683 (-0.058972)	1.486616 / 1.452155 (0.034461)	1.603229 / 1.492716 (0.110513)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.198990 / 0.018006 (0.180983)	0.427733 / 0.000490 (0.427243)	0.003612 / 0.000200 (0.003412)	0.000071 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022932 / 0.037411 (-0.014480)	0.096969 / 0.014526 (0.082443)	0.105749 / 0.176557 (-0.070807)	0.166101 / 0.737135 (-0.571034)	0.108646 / 0.296338 (-0.187692)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.428174 / 0.215209 (0.212965)	4.271452 / 2.077655 (2.193797)	1.907588 / 1.504120 (0.403468)	1.680870 / 1.541195 (0.139675)	1.761336 / 1.468490 (0.292846)	0.700380 / 4.584777 (-3.884396)	3.415168 / 3.745712 (-0.330544)	1.886122 / 5.269862 (-3.383740)	1.276814 / 4.565676 (-3.288863)	0.083429 / 0.424275 (-0.340846)	0.012988 / 0.007607 (0.005381)	0.518821 / 0.226044 (0.292776)	5.188284 / 2.268929 (2.919356)	2.433084 / 55.444624 (-53.011540)	1.988034 / 6.876477 (-4.888443)	2.100275 / 2.142072 (-0.041797)	0.808252 / 4.805227 (-3.996976)	0.158102 / 6.500664 (-6.342562)	0.067686 / 0.075469 (-0.007783)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.204171 / 1.841788 (-0.637616)	13.548756 / 8.074308 (5.474448)	14.339805 / 10.191392 (4.148413)	0.142853 / 0.680424 (-0.537571)	0.016529 / 0.534201 (-0.517672)	0.383800 / 0.579283 (-0.195483)	0.380362 / 0.434364 (-0.054002)	0.437716 / 0.540337 (-0.102621)	0.524306 / 1.386936 (-0.862630)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006730 / 0.011353 (-0.004623)	0.004652 / 0.011008 (-0.006356)	0.077476 / 0.038508 (0.038968)	0.027584 / 0.023109 (0.004475)	0.340907 / 0.275898 (0.065009)	0.377950 / 0.323480 (0.054470)	0.005946 / 0.007986 (-0.002040)	0.003548 / 0.004328 (-0.000780)	0.076270 / 0.004250 (0.072019)	0.037483 / 0.037052 (0.000431)	0.346390 / 0.258489 (0.087901)	0.384739 / 0.293841 (0.090898)	0.031744 / 0.128546 (-0.096802)	0.011598 / 0.075646 (-0.064049)	0.085651 / 0.419271 (-0.333620)	0.047308 / 0.043533 (0.003775)	0.344704 / 0.255139 (0.089565)	0.363410 / 0.283200 (0.080211)	0.095009 / 0.141683 (-0.046674)	1.478307 / 1.452155 (0.026152)	1.576808 / 1.492716 (0.084092)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.197545 / 0.018006 (0.179539)	0.431984 / 0.000490 (0.431494)	0.001529 / 0.000200 (0.001329)	0.000079 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025452 / 0.037411 (-0.011959)	0.100176 / 0.014526 (0.085651)	0.108222 / 0.176557 (-0.068335)	0.160556 / 0.737135 (-0.576580)	0.112748 / 0.296338 (-0.183591)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.436326 / 0.215209 (0.221117)	4.378443 / 2.077655 (2.300788)	2.056001 / 1.504120 (0.551881)	1.853406 / 1.541195 (0.312211)	1.931645 / 1.468490 (0.463155)	0.698340 / 4.584777 (-3.886437)	3.368961 / 3.745712 (-0.376751)	2.583622 / 5.269862 (-2.686239)	1.501274 / 4.565676 (-3.064402)	0.083034 / 0.424275 (-0.341241)	0.012725 / 0.007607 (0.005117)	0.539991 / 0.226044 (0.313947)	5.418413 / 2.268929 (3.149485)	2.517205 / 55.444624 (-52.927420)	2.179332 / 6.876477 (-4.697144)	2.215376 / 2.142072 (0.073304)	0.806133 / 4.805227 (-3.999094)	0.151499 / 6.500664 (-6.349165)	0.067270 / 0.075469 (-0.008199)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.308324 / 1.841788 (-0.533464)	14.357361 / 8.074308 (6.283053)	14.684768 / 10.191392 (4.493376)	0.139575 / 0.680424 (-0.540849)	0.016409 / 0.534201 (-0.517792)	0.374087 / 0.579283 (-0.205196)	0.390628 / 0.434364 (-0.043735)	0.443102 / 0.540337 (-0.097235)	0.536089 / 1.386936 (-0.850847)

polinaeterna added 2 commits March 3, 2023 20:45

don't save dataset info to cache dir when verification mode is 'no_ch…

7b2d000

…ecks'

remove comment

89721b2

polinaeterna requested a review from mariosasko March 3, 2023 20:13

polinaeterna marked this pull request as ready for review March 3, 2023 20:13

polinaeterna requested a review from lhoestq March 6, 2023 18:35

polinaeterna added 4 commits March 8, 2023 11:12

Merge branch 'huggingface:main' into do-not-save-info-when-no-checks

b9cacdd

fix outdated verification mode values: 'none' -> 'no_checks', 'all' -…

e45f3ea

…> 'all_checks'

Merge branch 'do-not-save-info-when-no-checks' of github.com:polinaet…

11d5f49

…erna/datasets into do-not-save-info-when-no-checks

change string repr of verification_mode to enum to avoid bugs if we c…

4b00c9d

…hange values again

lhoestq reviewed Mar 8, 2023

View reviewed changes

revert all changes except for fixing verification_mode values

315a9bd

polinaeterna changed the title ~~Don't save dataset info to cache dir when skipping verifications~~ Fix outdated verification_mode values Mar 8, 2023

lhoestq reviewed Mar 8, 2023

View reviewed changes

fix: delete dir when there are no files

dc747fb

lhoestq approved these changes Mar 8, 2023

View reviewed changes

mariosasko approved these changes Mar 9, 2023

View reviewed changes

Merge branch 'huggingface:main' into do-not-save-info-when-no-checks

1b514bb

polinaeterna merged commit 778d4e1 into huggingface:main Mar 9, 2023

polinaeterna deleted the do-not-save-info-when-no-checks branch March 9, 2023 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix outdated `verification_mode` values #5607

Fix outdated `verification_mode` values #5607

polinaeterna commented Mar 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 3, 2023 •

edited

Loading

lhoestq left a comment

lhoestq Mar 8, 2023

lhoestq Mar 8, 2023

polinaeterna Mar 8, 2023

lhoestq Mar 8, 2023

polinaeterna Mar 8, 2023 •

edited

Loading

lhoestq left a comment

lhoestq Mar 8, 2023

polinaeterna Mar 8, 2023

lhoestq left a comment

mariosasko left a comment

github-actions bot commented Mar 9, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

		if os.path.exists(self._cache_dir) > 0: # check if data exist
		if len(os.listdir(self._cache_dir)):

		if verification_mode is not VerificationMode.NO_CHECKS:
		self._save_info()

Fix outdated verification_mode values #5607

Fix outdated verification_mode values #5607

Conversation

polinaeterna commented Mar 3, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 3, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Mar 8, 2023

Choose a reason for hiding this comment

lhoestq Mar 8, 2023

Choose a reason for hiding this comment

polinaeterna Mar 8, 2023

Choose a reason for hiding this comment

lhoestq Mar 8, 2023

Choose a reason for hiding this comment

polinaeterna Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Mar 8, 2023

Choose a reason for hiding this comment

polinaeterna Mar 8, 2023

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 9, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix outdated `verification_mode` values #5607

Fix outdated `verification_mode` values #5607

polinaeterna commented Mar 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 3, 2023 •

edited

Loading

polinaeterna Mar 8, 2023 •

edited

Loading