Misc doc improvements #6074

mariosasko · 2023-07-26T12:20:54Z

Removes the warning about requiring to write a dataset loading script to define multiple configurations, as the README YAML can be used instead (for simple cases). Also, deletes the section about using the BatchSampler in torch<=1.12.1 to speed up loading, as torch 1.12.1 is over a year old (and torch 2.0 has been out for a while).

github-actions · 2023-07-26T12:22:52Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006616 / 0.011353 (-0.004737)	0.003915 / 0.011008 (-0.007093)	0.083271 / 0.038508 (0.044763)	0.072595 / 0.023109 (0.049485)	0.307224 / 0.275898 (0.031326)	0.337244 / 0.323480 (0.013764)	0.005296 / 0.007986 (-0.002690)	0.003325 / 0.004328 (-0.001003)	0.064589 / 0.004250 (0.060339)	0.056369 / 0.037052 (0.019316)	0.310829 / 0.258489 (0.052340)	0.345563 / 0.293841 (0.051722)	0.030551 / 0.128546 (-0.097995)	0.008519 / 0.075646 (-0.067127)	0.286368 / 0.419271 (-0.132903)	0.052498 / 0.043533 (0.008966)	0.308735 / 0.255139 (0.053596)	0.329234 / 0.283200 (0.046034)	0.022588 / 0.141683 (-0.119095)	1.453135 / 1.452155 (0.000981)	1.525956 / 1.492716 (0.033239)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199417 / 0.018006 (0.181410)	0.454621 / 0.000490 (0.454131)	0.004928 / 0.000200 (0.004728)	0.000079 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028436 / 0.037411 (-0.008975)	0.083722 / 0.014526 (0.069196)	0.095162 / 0.176557 (-0.081395)	0.153434 / 0.737135 (-0.583702)	0.099480 / 0.296338 (-0.196859)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.384647 / 0.215209 (0.169438)	3.838406 / 2.077655 (1.760751)	1.891267 / 1.504120 (0.387148)	1.751432 / 1.541195 (0.210238)	1.737443 / 1.468490 (0.268953)	0.487758 / 4.584777 (-4.097019)	3.635925 / 3.745712 (-0.109787)	5.208718 / 5.269862 (-0.061144)	3.029374 / 4.565676 (-1.536302)	0.057613 / 0.424275 (-0.366662)	0.007177 / 0.007607 (-0.000430)	0.455596 / 0.226044 (0.229552)	4.559969 / 2.268929 (2.291040)	2.325321 / 55.444624 (-53.119303)	2.034924 / 6.876477 (-4.841552)	2.163869 / 2.142072 (0.021796)	0.583477 / 4.805227 (-4.221750)	0.132870 / 6.500664 (-6.367795)	0.059618 / 0.075469 (-0.015851)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.263751 / 1.841788 (-0.578037)	19.740004 / 8.074308 (11.665696)	14.410980 / 10.191392 (4.219588)	0.170367 / 0.680424 (-0.510057)	0.018225 / 0.534201 (-0.515976)	0.390101 / 0.579283 (-0.189182)	0.404298 / 0.434364 (-0.030066)	0.455295 / 0.540337 (-0.085043)	0.621179 / 1.386936 (-0.765757)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006580 / 0.011353 (-0.004773)	0.004078 / 0.011008 (-0.006930)	0.065842 / 0.038508 (0.027334)	0.074494 / 0.023109 (0.051385)	0.403644 / 0.275898 (0.127746)	0.430204 / 0.323480 (0.106724)	0.005343 / 0.007986 (-0.002643)	0.003366 / 0.004328 (-0.000963)	0.064858 / 0.004250 (0.060607)	0.056252 / 0.037052 (0.019200)	0.412556 / 0.258489 (0.154067)	0.434099 / 0.293841 (0.140258)	0.031518 / 0.128546 (-0.097028)	0.008543 / 0.075646 (-0.067104)	0.071658 / 0.419271 (-0.347613)	0.049962 / 0.043533 (0.006430)	0.398511 / 0.255139 (0.143372)	0.415908 / 0.283200 (0.132708)	0.025011 / 0.141683 (-0.116672)	1.492350 / 1.452155 (0.040195)	1.552996 / 1.492716 (0.060280)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.204971 / 0.018006 (0.186964)	0.439965 / 0.000490 (0.439475)	0.002071 / 0.000200 (0.001872)	0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031673 / 0.037411 (-0.005738)	0.087529 / 0.014526 (0.073004)	0.099882 / 0.176557 (-0.076675)	0.156994 / 0.737135 (-0.580141)	0.101421 / 0.296338 (-0.194918)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.407480 / 0.215209 (0.192271)	4.069123 / 2.077655 (1.991468)	2.081288 / 1.504120 (0.577169)	1.920367 / 1.541195 (0.379172)	1.981053 / 1.468490 (0.512563)	0.481995 / 4.584777 (-4.102782)	3.546486 / 3.745712 (-0.199226)	5.133150 / 5.269862 (-0.136712)	3.056444 / 4.565676 (-1.509232)	0.056650 / 0.424275 (-0.367625)	0.007746 / 0.007607 (0.000139)	0.490891 / 0.226044 (0.264847)	4.902160 / 2.268929 (2.633232)	2.564726 / 55.444624 (-52.879899)	2.234988 / 6.876477 (-4.641489)	2.387656 / 2.142072 (0.245583)	0.576315 / 4.805227 (-4.228912)	0.132065 / 6.500664 (-6.368599)	0.060728 / 0.075469 (-0.014741)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.370568 / 1.841788 (-0.471220)	19.883159 / 8.074308 (11.808851)	14.442066 / 10.191392 (4.250674)	0.150119 / 0.680424 (-0.530305)	0.018359 / 0.534201 (-0.515842)	0.394128 / 0.579283 (-0.185155)	0.411697 / 0.434364 (-0.022667)	0.460580 / 0.540337 (-0.079757)	0.608490 / 1.386936 (-0.778446)

HuggingFaceDocBuilderDev · 2023-07-26T12:27:04Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

LGTM :)

lhoestq · 2023-07-27T16:15:57Z

merging now if you don't mind - this way I can make a patch release

Misc doc improvements

035d0cf

mariosasko requested a review from stevhliu July 27, 2023 12:21

lhoestq approved these changes Jul 27, 2023

View reviewed changes

lhoestq merged commit e7008b5 into main Jul 27, 2023

lhoestq deleted the improve-docs branch July 27, 2023 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc doc improvements #6074

Misc doc improvements #6074

mariosasko commented Jul 26, 2023 •

edited

Loading

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

Loading

lhoestq left a comment

lhoestq commented Jul 27, 2023

Misc doc improvements #6074

Misc doc improvements #6074

Conversation

mariosasko commented Jul 26, 2023 • edited Loading

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 26, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Jul 27, 2023

mariosasko commented Jul 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

Loading