Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a new low-memory approach for tf dataset index shuffling #5863

Merged
merged 16 commits into from
Jun 8, 2023

Conversation

Rocketknight1
Copy link
Member

This PR tries out a new approach to generating the index tensor in to_tf_dataset, which should reduce memory usage for very large datasets. I'll need to do some testing before merging it!

Fixes #5855

@Rocketknight1 Rocketknight1 requested a review from lhoestq May 15, 2023 15:28
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007764 / 0.011353 (-0.003588) 0.005397 / 0.011008 (-0.005611) 0.097995 / 0.038508 (0.059487) 0.036360 / 0.023109 (0.013251) 0.312148 / 0.275898 (0.036250) 0.349427 / 0.323480 (0.025947) 0.006635 / 0.007986 (-0.001350) 0.004373 / 0.004328 (0.000044) 0.074350 / 0.004250 (0.070099) 0.054667 / 0.037052 (0.017614) 0.301621 / 0.258489 (0.043132) 0.364233 / 0.293841 (0.070392) 0.035356 / 0.128546 (-0.093191) 0.012512 / 0.075646 (-0.063134) 0.333399 / 0.419271 (-0.085873) 0.051363 / 0.043533 (0.007830) 0.302372 / 0.255139 (0.047233) 0.326542 / 0.283200 (0.043343) 0.118610 / 0.141683 (-0.023073) 1.438485 / 1.452155 (-0.013669) 1.539131 / 1.492716 (0.046415)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.010920 / 0.018006 (-0.007086) 0.561263 / 0.000490 (0.560773) 0.003972 / 0.000200 (0.003772) 0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030333 / 0.037411 (-0.007078) 0.113608 / 0.014526 (0.099083) 0.125802 / 0.176557 (-0.050755) 0.183885 / 0.737135 (-0.553250) 0.130242 / 0.296338 (-0.166097)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.404147 / 0.215209 (0.188938) 4.021990 / 2.077655 (1.944335) 1.821450 / 1.504120 (0.317330) 1.619032 / 1.541195 (0.077837) 1.791267 / 1.468490 (0.322777) 0.706683 / 4.584777 (-3.878094) 3.819056 / 3.745712 (0.073344) 3.485714 / 5.269862 (-1.784147) 1.938968 / 4.565676 (-2.626709) 0.086501 / 0.424275 (-0.337774) 0.012300 / 0.007607 (0.004693) 0.503600 / 0.226044 (0.277555) 5.042123 / 2.268929 (2.773195) 2.269712 / 55.444624 (-53.174912) 1.944912 / 6.876477 (-4.931565) 2.155196 / 2.142072 (0.013123) 0.853434 / 4.805227 (-3.951793) 0.175554 / 6.500664 (-6.325110) 0.072005 / 0.075469 (-0.003464)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.203765 / 1.841788 (-0.638022) 15.836634 / 8.074308 (7.762326) 15.707348 / 10.191392 (5.515956) 0.164828 / 0.680424 (-0.515596) 0.018115 / 0.534201 (-0.516086) 0.434591 / 0.579283 (-0.144692) 0.437858 / 0.434364 (0.003495) 0.524672 / 0.540337 (-0.015665) 0.610535 / 1.386936 (-0.776401)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007558 / 0.011353 (-0.003795) 0.005258 / 0.011008 (-0.005750) 0.075263 / 0.038508 (0.036755) 0.033915 / 0.023109 (0.010805) 0.371368 / 0.275898 (0.095470) 0.399239 / 0.323480 (0.075760) 0.006547 / 0.007986 (-0.001439) 0.004675 / 0.004328 (0.000347) 0.074230 / 0.004250 (0.069980) 0.054653 / 0.037052 (0.017601) 0.376655 / 0.258489 (0.118166) 0.438437 / 0.293841 (0.144596) 0.035838 / 0.128546 (-0.092709) 0.012641 / 0.075646 (-0.063005) 0.087279 / 0.419271 (-0.331993) 0.046311 / 0.043533 (0.002778) 0.356649 / 0.255139 (0.101510) 0.377876 / 0.283200 (0.094677) 0.108097 / 0.141683 (-0.033586) 1.478461 / 1.452155 (0.026306) 1.560375 / 1.492716 (0.067658)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.316384 / 0.018006 (0.298378) 0.539382 / 0.000490 (0.538892) 0.002029 / 0.000200 (0.001829) 0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029950 / 0.037411 (-0.007462) 0.111371 / 0.014526 (0.096846) 0.125254 / 0.176557 (-0.051303) 0.173064 / 0.737135 (-0.564071) 0.130446 / 0.296338 (-0.165893)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.424882 / 0.215209 (0.209673) 4.241575 / 2.077655 (2.163920) 2.096216 / 1.504120 (0.592096) 1.916017 / 1.541195 (0.374823) 2.016318 / 1.468490 (0.547828) 0.701197 / 4.584777 (-3.883580) 3.762365 / 3.745712 (0.016652) 3.307805 / 5.269862 (-1.962057) 1.841752 / 4.565676 (-2.723925) 0.086003 / 0.424275 (-0.338272) 0.012247 / 0.007607 (0.004640) 0.532926 / 0.226044 (0.306882) 5.370509 / 2.268929 (3.101580) 2.587853 / 55.444624 (-52.856772) 2.264541 / 6.876477 (-4.611936) 2.374833 / 2.142072 (0.232760) 0.827751 / 4.805227 (-3.977476) 0.169454 / 6.500664 (-6.331210) 0.066340 / 0.075469 (-0.009129)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.319128 / 1.841788 (-0.522660) 16.702085 / 8.074308 (8.627777) 13.559957 / 10.191392 (3.368565) 0.146659 / 0.680424 (-0.533765) 0.017384 / 0.534201 (-0.516817) 0.421126 / 0.579283 (-0.158157) 0.422067 / 0.434364 (-0.012297) 0.490615 / 0.540337 (-0.049723) 0.587151 / 1.386936 (-0.799785)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006604 / 0.011353 (-0.004749) 0.004508 / 0.011008 (-0.006500) 0.098652 / 0.038508 (0.060144) 0.028172 / 0.023109 (0.005063) 0.366997 / 0.275898 (0.091099) 0.403691 / 0.323480 (0.080211) 0.005127 / 0.007986 (-0.002859) 0.003340 / 0.004328 (-0.000989) 0.075408 / 0.004250 (0.071157) 0.038049 / 0.037052 (0.000996) 0.367914 / 0.258489 (0.109425) 0.410958 / 0.293841 (0.117118) 0.030454 / 0.128546 (-0.098093) 0.011422 / 0.075646 (-0.064224) 0.325048 / 0.419271 (-0.094223) 0.042959 / 0.043533 (-0.000574) 0.374536 / 0.255139 (0.119397) 0.394738 / 0.283200 (0.111538) 0.090481 / 0.141683 (-0.051201) 1.504858 / 1.452155 (0.052703) 1.569072 / 1.492716 (0.076356)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.010062 / 0.018006 (-0.007945) 0.408619 / 0.000490 (0.408130) 0.002307 / 0.000200 (0.002107) 0.000070 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022898 / 0.037411 (-0.014514) 0.096975 / 0.014526 (0.082449) 0.103032 / 0.176557 (-0.073524) 0.164877 / 0.737135 (-0.572259) 0.107324 / 0.296338 (-0.189014)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446652 / 0.215209 (0.231442) 4.466939 / 2.077655 (2.389285) 2.204590 / 1.504120 (0.700471) 2.004048 / 1.541195 (0.462853) 2.053035 / 1.468490 (0.584545) 0.696617 / 4.584777 (-3.888160) 3.391173 / 3.745712 (-0.354539) 1.863306 / 5.269862 (-3.406556) 1.160637 / 4.565676 (-3.405039) 0.083115 / 0.424275 (-0.341160) 0.012470 / 0.007607 (0.004862) 0.547207 / 0.226044 (0.321163) 5.500667 / 2.268929 (3.231739) 2.656615 / 55.444624 (-52.788009) 2.313281 / 6.876477 (-4.563195) 2.395632 / 2.142072 (0.253559) 0.815361 / 4.805227 (-3.989867) 0.152112 / 6.500664 (-6.348552) 0.067485 / 0.075469 (-0.007984)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.206975 / 1.841788 (-0.634813) 13.684136 / 8.074308 (5.609828) 13.919129 / 10.191392 (3.727737) 0.140767 / 0.680424 (-0.539657) 0.016445 / 0.534201 (-0.517756) 0.379136 / 0.579283 (-0.200147) 0.385395 / 0.434364 (-0.048969) 0.445781 / 0.540337 (-0.094556) 0.522056 / 1.386936 (-0.864880)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006370 / 0.011353 (-0.004983) 0.004514 / 0.011008 (-0.006495) 0.075671 / 0.038508 (0.037163) 0.026723 / 0.023109 (0.003614) 0.359819 / 0.275898 (0.083921) 0.387935 / 0.323480 (0.064456) 0.004888 / 0.007986 (-0.003098) 0.004619 / 0.004328 (0.000290) 0.075546 / 0.004250 (0.071295) 0.039024 / 0.037052 (0.001971) 0.361173 / 0.258489 (0.102684) 0.411425 / 0.293841 (0.117584) 0.030842 / 0.128546 (-0.097705) 0.011555 / 0.075646 (-0.064091) 0.084697 / 0.419271 (-0.334574) 0.039281 / 0.043533 (-0.004252) 0.370082 / 0.255139 (0.114943) 0.382113 / 0.283200 (0.098913) 0.091237 / 0.141683 (-0.050445) 1.534185 / 1.452155 (0.082030) 1.576488 / 1.492716 (0.083772)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226568 / 0.018006 (0.208562) 0.401566 / 0.000490 (0.401076) 0.002915 / 0.000200 (0.002715) 0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025357 / 0.037411 (-0.012054) 0.099747 / 0.014526 (0.085221) 0.106443 / 0.176557 (-0.070113) 0.157147 / 0.737135 (-0.579989) 0.110759 / 0.296338 (-0.185580)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444648 / 0.215209 (0.229439) 4.437930 / 2.077655 (2.360275) 2.154033 / 1.504120 (0.649913) 1.958351 / 1.541195 (0.417157) 1.991031 / 1.468490 (0.522541) 0.691440 / 4.584777 (-3.893337) 3.369087 / 3.745712 (-0.376625) 1.847103 / 5.269862 (-3.422758) 1.152509 / 4.565676 (-3.413168) 0.082519 / 0.424275 (-0.341756) 0.012609 / 0.007607 (0.005001) 0.547267 / 0.226044 (0.321222) 5.501335 / 2.268929 (3.232407) 2.621079 / 55.444624 (-52.823545) 2.281332 / 6.876477 (-4.595145) 2.300427 / 2.142072 (0.158354) 0.803611 / 4.805227 (-4.001616) 0.151784 / 6.500664 (-6.348880) 0.067801 / 0.075469 (-0.007669)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.343201 / 1.841788 (-0.498587) 13.901033 / 8.074308 (5.826725) 13.114738 / 10.191392 (2.923346) 0.149358 / 0.680424 (-0.531066) 0.016596 / 0.534201 (-0.517605) 0.377310 / 0.579283 (-0.201973) 0.387045 / 0.434364 (-0.047319) 0.441272 / 0.540337 (-0.099065) 0.525783 / 1.386936 (-0.861153)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008147 / 0.011353 (-0.003205) 0.005531 / 0.011008 (-0.005477) 0.099796 / 0.038508 (0.061288) 0.041574 / 0.023109 (0.018465) 0.315752 / 0.275898 (0.039854) 0.369846 / 0.323480 (0.046366) 0.006489 / 0.007986 (-0.001497) 0.004339 / 0.004328 (0.000010) 0.074769 / 0.004250 (0.070519) 0.051313 / 0.037052 (0.014261) 0.313463 / 0.258489 (0.054974) 0.369918 / 0.293841 (0.076077) 0.035893 / 0.128546 (-0.092653) 0.012487 / 0.075646 (-0.063159) 0.336464 / 0.419271 (-0.082807) 0.052870 / 0.043533 (0.009337) 0.310795 / 0.255139 (0.055656) 0.333146 / 0.283200 (0.049946) 0.112813 / 0.141683 (-0.028870) 1.488192 / 1.452155 (0.036038) 1.563438 / 1.492716 (0.070721)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.015015 / 0.018006 (-0.002991) 0.531783 / 0.000490 (0.531294) 0.005039 / 0.000200 (0.004839) 0.000103 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030205 / 0.037411 (-0.007207) 0.115997 / 0.014526 (0.101471) 0.122958 / 0.176557 (-0.053599) 0.186956 / 0.737135 (-0.550180) 0.130268 / 0.296338 (-0.166071)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.402648 / 0.215209 (0.187439) 3.996121 / 2.077655 (1.918466) 1.811715 / 1.504120 (0.307595) 1.640805 / 1.541195 (0.099610) 1.810478 / 1.468490 (0.341988) 0.699996 / 4.584777 (-3.884781) 3.834020 / 3.745712 (0.088308) 3.688364 / 5.269862 (-1.581498) 1.973828 / 4.565676 (-2.591849) 0.087085 / 0.424275 (-0.337190) 0.012501 / 0.007607 (0.004894) 0.498934 / 0.226044 (0.272889) 4.977608 / 2.268929 (2.708680) 2.258678 / 55.444624 (-53.185947) 1.934251 / 6.876477 (-4.942226) 2.177409 / 2.142072 (0.035337) 0.873470 / 4.805227 (-3.931757) 0.173132 / 6.500664 (-6.327532) 0.069144 / 0.075469 (-0.006325)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.181554 / 1.841788 (-0.660234) 15.694468 / 8.074308 (7.620160) 15.026954 / 10.191392 (4.835562) 0.167092 / 0.680424 (-0.513332) 0.017921 / 0.534201 (-0.516280) 0.425649 / 0.579283 (-0.153634) 0.423225 / 0.434364 (-0.011139) 0.522132 / 0.540337 (-0.018205) 0.612806 / 1.386936 (-0.774130)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007896 / 0.011353 (-0.003457) 0.005581 / 0.011008 (-0.005427) 0.076338 / 0.038508 (0.037830) 0.037064 / 0.023109 (0.013954) 0.399706 / 0.275898 (0.123808) 0.431698 / 0.323480 (0.108218) 0.006846 / 0.007986 (-0.001140) 0.006010 / 0.004328 (0.001682) 0.075771 / 0.004250 (0.071520) 0.058214 / 0.037052 (0.021161) 0.395753 / 0.258489 (0.137264) 0.459925 / 0.293841 (0.166084) 0.036349 / 0.128546 (-0.092197) 0.012720 / 0.075646 (-0.062926) 0.087248 / 0.419271 (-0.332024) 0.049405 / 0.043533 (0.005872) 0.387576 / 0.255139 (0.132437) 0.409861 / 0.283200 (0.126661) 0.111639 / 0.141683 (-0.030043) 1.482840 / 1.452155 (0.030685) 1.574465 / 1.492716 (0.081749)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.320628 / 0.018006 (0.302622) 0.556338 / 0.000490 (0.555848) 0.000445 / 0.000200 (0.000245) 0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032905 / 0.037411 (-0.004507) 0.121253 / 0.014526 (0.106727) 0.127241 / 0.176557 (-0.049316) 0.178090 / 0.737135 (-0.559045) 0.143285 / 0.296338 (-0.153054)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.437852 / 0.215209 (0.222643) 4.369770 / 2.077655 (2.292115) 2.219932 / 1.504120 (0.715812) 2.032520 / 1.541195 (0.491325) 2.154300 / 1.468490 (0.685810) 0.678942 / 4.584777 (-3.905835) 3.768148 / 3.745712 (0.022436) 2.152738 / 5.269862 (-3.117124) 1.341480 / 4.565676 (-3.224197) 0.084326 / 0.424275 (-0.339949) 0.012288 / 0.007607 (0.004681) 0.547677 / 0.226044 (0.321633) 5.496777 / 2.268929 (3.227848) 2.702267 / 55.444624 (-52.742357) 2.388580 / 6.876477 (-4.487897) 2.471673 / 2.142072 (0.329601) 0.833645 / 4.805227 (-3.971582) 0.167113 / 6.500664 (-6.333551) 0.067658 / 0.075469 (-0.007811)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.282050 / 1.841788 (-0.559737) 16.413677 / 8.074308 (8.339369) 14.080910 / 10.191392 (3.889518) 0.171782 / 0.680424 (-0.508642) 0.018186 / 0.534201 (-0.516015) 0.425244 / 0.579283 (-0.154039) 0.430260 / 0.434364 (-0.004104) 0.500838 / 0.540337 (-0.039499) 0.591900 / 1.386936 (-0.795036)

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented May 15, 2023

The approach we take here is to no longer materialize the entire index array or shuffle buffer. Instead, we do the following:

  1. Generate a dataset with tf.data.Dataset.range. This dataset is not materialized - it's basically a range iterator.
  2. When we begin iterating over a dataset, generate a random seed. This value is constant for each pass over the dataset, and is regenerated if we start a new iteration or epoch over the dataset.
  3. Map the range dataset and the random seed with tf.random.index_shuffle. This converts indices into the equivalent values in a permuted array. In other words tf.random.index_shuffle(indices, maxval=50_000_000) is equivalent to np.random.permutation(50_000_000)[indices], but without ever materializing the np.random.permutation(50_000_000) array.

Using this approach gives us a complete iteration over the dataset that does not skip any samples, compiles in TF and also never materializes the complete index array, which should avoid the memory usage issues. I'm testing that now!

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008395 / 0.011353 (-0.002958) 0.005893 / 0.011008 (-0.005115) 0.117081 / 0.038508 (0.078573) 0.040987 / 0.023109 (0.017878) 0.394234 / 0.275898 (0.118336) 0.447036 / 0.323480 (0.123556) 0.006703 / 0.007986 (-0.001283) 0.006085 / 0.004328 (0.001757) 0.086479 / 0.004250 (0.082228) 0.050192 / 0.037052 (0.013140) 0.400958 / 0.258489 (0.142469) 0.455551 / 0.293841 (0.161710) 0.041481 / 0.128546 (-0.087065) 0.014135 / 0.075646 (-0.061511) 0.399929 / 0.419271 (-0.019343) 0.060824 / 0.043533 (0.017291) 0.395946 / 0.255139 (0.140807) 0.428811 / 0.283200 (0.145611) 0.120057 / 0.141683 (-0.021626) 1.703244 / 1.452155 (0.251090) 1.841153 / 1.492716 (0.348436)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.021826 / 0.018006 (0.003820) 0.494279 / 0.000490 (0.493789) 0.011258 / 0.000200 (0.011058) 0.000382 / 0.000054 (0.000328)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031651 / 0.037411 (-0.005760) 0.132871 / 0.014526 (0.118345) 0.137388 / 0.176557 (-0.039169) 0.205808 / 0.737135 (-0.531327) 0.147585 / 0.296338 (-0.148753)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474483 / 0.215209 (0.259274) 4.726568 / 2.077655 (2.648914) 2.136172 / 1.504120 (0.632052) 1.918364 / 1.541195 (0.377169) 2.068794 / 1.468490 (0.600304) 0.836481 / 4.584777 (-3.748296) 4.550583 / 3.745712 (0.804871) 2.456287 / 5.269862 (-2.813574) 1.563127 / 4.565676 (-3.002550) 0.102541 / 0.424275 (-0.321734) 0.014492 / 0.007607 (0.006885) 0.598572 / 0.226044 (0.372528) 5.953321 / 2.268929 (3.684392) 2.695210 / 55.444624 (-52.749414) 2.294317 / 6.876477 (-4.582160) 2.456585 / 2.142072 (0.314513) 1.019907 / 4.805227 (-3.785320) 0.201225 / 6.500664 (-6.299439) 0.077113 / 0.075469 (0.001644)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.497662 / 1.841788 (-0.344126) 18.216941 / 8.074308 (10.142633) 17.016638 / 10.191392 (6.825246) 0.193271 / 0.680424 (-0.487153) 0.020440 / 0.534201 (-0.513761) 0.509361 / 0.579283 (-0.069922) 0.513389 / 0.434364 (0.079025) 0.622266 / 0.540337 (0.081928) 0.741733 / 1.386936 (-0.645203)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008641 / 0.011353 (-0.002712) 0.005792 / 0.011008 (-0.005216) 0.086020 / 0.038508 (0.047512) 0.040005 / 0.023109 (0.016896) 0.435120 / 0.275898 (0.159222) 0.480269 / 0.323480 (0.156789) 0.006669 / 0.007986 (-0.001317) 0.006039 / 0.004328 (0.001711) 0.083468 / 0.004250 (0.079218) 0.057700 / 0.037052 (0.020648) 0.416418 / 0.258489 (0.157929) 0.508286 / 0.293841 (0.214445) 0.041198 / 0.128546 (-0.087349) 0.014346 / 0.075646 (-0.061301) 0.100553 / 0.419271 (-0.318718) 0.054201 / 0.043533 (0.010668) 0.438232 / 0.255139 (0.183093) 0.454707 / 0.283200 (0.171508) 0.118332 / 0.141683 (-0.023351) 1.657607 / 1.452155 (0.205452) 1.825510 / 1.492716 (0.332794)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.236156 / 0.018006 (0.218150) 0.487612 / 0.000490 (0.487123) 0.005747 / 0.000200 (0.005547) 0.000111 / 0.000054 (0.000057)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035127 / 0.037411 (-0.002284) 0.132013 / 0.014526 (0.117487) 0.142316 / 0.176557 (-0.034241) 0.198627 / 0.737135 (-0.538508) 0.145454 / 0.296338 (-0.150885)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.513041 / 0.215209 (0.297832) 5.066197 / 2.077655 (2.988542) 2.508779 / 1.504120 (1.004659) 2.273901 / 1.541195 (0.732706) 2.364958 / 1.468490 (0.896468) 0.811367 / 4.584777 (-3.773410) 4.504744 / 3.745712 (0.759032) 2.499811 / 5.269862 (-2.770050) 1.583349 / 4.565676 (-2.982328) 0.101701 / 0.424275 (-0.322574) 0.014379 / 0.007607 (0.006772) 0.669506 / 0.226044 (0.443462) 6.556702 / 2.268929 (4.287774) 3.123457 / 55.444624 (-52.321167) 2.731997 / 6.876477 (-4.144480) 2.862866 / 2.142072 (0.720794) 0.992956 / 4.805227 (-3.812271) 0.200473 / 6.500664 (-6.300191) 0.078780 / 0.075469 (0.003311)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.540718 / 1.841788 (-0.301070) 18.749344 / 8.074308 (10.675036) 15.648983 / 10.191392 (5.457591) 0.174089 / 0.680424 (-0.506335) 0.020441 / 0.534201 (-0.513760) 0.503742 / 0.579283 (-0.075541) 0.500648 / 0.434364 (0.066284) 0.598558 / 0.540337 (0.058221) 0.712093 / 1.386936 (-0.674843)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009940 / 0.011353 (-0.001412) 0.006193 / 0.011008 (-0.004815) 0.125874 / 0.038508 (0.087366) 0.038664 / 0.023109 (0.015555) 0.380013 / 0.275898 (0.104115) 0.430152 / 0.323480 (0.106672) 0.006961 / 0.007986 (-0.001025) 0.004749 / 0.004328 (0.000420) 0.099743 / 0.004250 (0.095492) 0.052349 / 0.037052 (0.015297) 0.433354 / 0.258489 (0.174865) 0.436273 / 0.293841 (0.142433) 0.053929 / 0.128546 (-0.074617) 0.019369 / 0.075646 (-0.056278) 0.421783 / 0.419271 (0.002511) 0.062746 / 0.043533 (0.019213) 0.377225 / 0.255139 (0.122086) 0.413708 / 0.283200 (0.130508) 0.111371 / 0.141683 (-0.030312) 1.819166 / 1.452155 (0.367011) 1.974527 / 1.492716 (0.481810)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.090664 / 0.018006 (0.072658) 0.566166 / 0.000490 (0.565676) 0.079305 / 0.000200 (0.079105) 0.000755 / 0.000054 (0.000700)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029720 / 0.037411 (-0.007691) 0.126030 / 0.014526 (0.111504) 0.146020 / 0.176557 (-0.030537) 0.210354 / 0.737135 (-0.526781) 0.149428 / 0.296338 (-0.146910)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.624371 / 0.215209 (0.409162) 6.332839 / 2.077655 (4.255184) 2.547784 / 1.504120 (1.043664) 2.150508 / 1.541195 (0.609313) 2.240816 / 1.468490 (0.772326) 1.271131 / 4.584777 (-3.313646) 5.642726 / 3.745712 (1.897014) 3.212988 / 5.269862 (-2.056874) 2.258123 / 4.565676 (-2.307553) 0.149477 / 0.424275 (-0.274798) 0.014603 / 0.007607 (0.006996) 0.782155 / 0.226044 (0.556111) 7.855191 / 2.268929 (5.586262) 3.308638 / 55.444624 (-52.135986) 2.548142 / 6.876477 (-4.328335) 2.627374 / 2.142072 (0.485301) 1.515170 / 4.805227 (-3.290058) 0.262479 / 6.500664 (-6.238185) 0.082181 / 0.075469 (0.006712)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.573169 / 1.841788 (-0.268618) 18.105719 / 8.074308 (10.031411) 22.015179 / 10.191392 (11.823787) 0.254678 / 0.680424 (-0.425746) 0.027098 / 0.534201 (-0.507103) 0.578045 / 0.579283 (-0.001238) 0.647130 / 0.434364 (0.212766) 0.650522 / 0.540337 (0.110185) 0.797713 / 1.386936 (-0.589223)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010376 / 0.011353 (-0.000977) 0.005990 / 0.011008 (-0.005018) 0.097144 / 0.038508 (0.058635) 0.038205 / 0.023109 (0.015096) 0.468347 / 0.275898 (0.192449) 0.497646 / 0.323480 (0.174166) 0.006916 / 0.007986 (-0.001069) 0.004760 / 0.004328 (0.000431) 0.109838 / 0.004250 (0.105587) 0.048321 / 0.037052 (0.011269) 0.437458 / 0.258489 (0.178969) 0.534864 / 0.293841 (0.241023) 0.053655 / 0.128546 (-0.074892) 0.021915 / 0.075646 (-0.053732) 0.121047 / 0.419271 (-0.298224) 0.059694 / 0.043533 (0.016162) 0.466937 / 0.255139 (0.211798) 0.482030 / 0.283200 (0.198831) 0.117458 / 0.141683 (-0.024225) 1.835551 / 1.452155 (0.383396) 1.965748 / 1.492716 (0.473031)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.234885 / 0.018006 (0.216879) 0.529925 / 0.000490 (0.529436) 0.000484 / 0.000200 (0.000284) 0.000085 / 0.000054 (0.000031)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030959 / 0.037411 (-0.006453) 0.128905 / 0.014526 (0.114379) 0.136913 / 0.176557 (-0.039643) 0.195133 / 0.737135 (-0.542002) 0.147929 / 0.296338 (-0.148410)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.715661 / 0.215209 (0.500451) 6.994125 / 2.077655 (4.916470) 3.033178 / 1.504120 (1.529058) 2.663709 / 1.541195 (1.122515) 2.707558 / 1.468490 (1.239068) 1.316195 / 4.584777 (-3.268582) 5.688264 / 3.745712 (1.942552) 3.260897 / 5.269862 (-2.008964) 2.134985 / 4.565676 (-2.430691) 0.153945 / 0.424275 (-0.270330) 0.014727 / 0.007607 (0.007119) 0.911339 / 0.226044 (0.685294) 8.902640 / 2.268929 (6.633711) 3.806606 / 55.444624 (-51.638018) 3.052238 / 6.876477 (-3.824238) 3.046945 / 2.142072 (0.904873) 1.559837 / 4.805227 (-3.245390) 0.272276 / 6.500664 (-6.228388) 0.087728 / 0.075469 (0.012259)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.712691 / 1.841788 (-0.129097) 18.127575 / 8.074308 (10.053267) 19.734063 / 10.191392 (9.542671) 0.235006 / 0.680424 (-0.445418) 0.027581 / 0.534201 (-0.506620) 0.551080 / 0.579283 (-0.028203) 0.608564 / 0.434364 (0.174200) 0.636578 / 0.540337 (0.096241) 0.732374 / 1.386936 (-0.654562)

@Rocketknight1
Copy link
Member Author

Looks good in testing - this should be ready for review! cc @lhoestq @massquantity

@massquantity
Copy link

massquantity commented May 16, 2023

Looks good to me, though i doubt that very few people will upgrade to TF >= 2.9 unless their memory is full:)

@lhoestq
Copy link
Member

lhoestq commented May 16, 2023

Is it more efficient than using numpy to shuffle as in multiprocessing ? Why not use the same strategy ?

@Rocketknight1
Copy link
Member Author

Good question, honestly! The NumPy strategy works fine, but requires us to handle multiple processes instead of doing everything in tf.data. We could just scrap this entire code path and always use the multiprocessing NumPy approach, but I think single-threaded throughput would be lower if we did that. If you prefer it for code simplicity, though, I can do that.

In the longer term, I'm hoping that tf.data gets native support for our data structures and we can transition the whole pipeline to pure tf.data, but that still hasn't happened 🫠

@Rocketknight1
Copy link
Member Author

And @massquantity TF 2.13 is going to release in a couple of days, so I hope most users are at least on TF 2.9 by now!

@lhoestq
Copy link
Member

lhoestq commented May 16, 2023

Unless there is a big gap in performance I think code simplicity would be appreciated ^^

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008638 / 0.011353 (-0.002715) 0.006013 / 0.011008 (-0.004995) 0.116456 / 0.038508 (0.077948) 0.040419 / 0.023109 (0.017310) 0.418374 / 0.275898 (0.142476) 0.447693 / 0.323480 (0.124213) 0.007002 / 0.007986 (-0.000984) 0.006175 / 0.004328 (0.001847) 0.087801 / 0.004250 (0.083550) 0.051980 / 0.037052 (0.014928) 0.393275 / 0.258489 (0.134786) 0.449601 / 0.293841 (0.155760) 0.041670 / 0.128546 (-0.086876) 0.014396 / 0.075646 (-0.061251) 0.399175 / 0.419271 (-0.020096) 0.060635 / 0.043533 (0.017102) 0.391449 / 0.255139 (0.136310) 0.420713 / 0.283200 (0.137513) 0.121369 / 0.141683 (-0.020314) 1.692630 / 1.452155 (0.240475) 1.815526 / 1.492716 (0.322810)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.244321 / 0.018006 (0.226315) 0.487947 / 0.000490 (0.487458) 0.004563 / 0.000200 (0.004363) 0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033425 / 0.037411 (-0.003987) 0.134458 / 0.014526 (0.119932) 0.138810 / 0.176557 (-0.037746) 0.208871 / 0.737135 (-0.528264) 0.147964 / 0.296338 (-0.148374)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.483347 / 0.215209 (0.268138) 4.799550 / 2.077655 (2.721895) 2.174149 / 1.504120 (0.670029) 1.943276 / 1.541195 (0.402081) 2.010884 / 1.468490 (0.542394) 0.832030 / 4.584777 (-3.752747) 4.716713 / 3.745712 (0.971001) 4.615810 / 5.269862 (-0.654052) 2.379600 / 4.565676 (-2.186077) 0.103560 / 0.424275 (-0.320715) 0.014683 / 0.007607 (0.007076) 0.598558 / 0.226044 (0.372514) 5.999126 / 2.268929 (3.730197) 2.677819 / 55.444624 (-52.766805) 2.320838 / 6.876477 (-4.555639) 2.503684 / 2.142072 (0.361611) 1.016459 / 4.805227 (-3.788769) 0.201672 / 6.500664 (-6.298992) 0.079310 / 0.075469 (0.003841)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.446374 / 1.841788 (-0.395413) 19.219310 / 8.074308 (11.145002) 17.294665 / 10.191392 (7.103273) 0.246115 / 0.680424 (-0.434309) 0.021406 / 0.534201 (-0.512795) 0.524084 / 0.579283 (-0.055200) 0.511254 / 0.434364 (0.076890) 0.621304 / 0.540337 (0.080966) 0.727088 / 1.386936 (-0.659848)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008907 / 0.011353 (-0.002446) 0.006165 / 0.011008 (-0.004843) 0.090786 / 0.038508 (0.052278) 0.040893 / 0.023109 (0.017784) 0.451252 / 0.275898 (0.175354) 0.477811 / 0.323480 (0.154331) 0.007418 / 0.007986 (-0.000568) 0.005789 / 0.004328 (0.001461) 0.087422 / 0.004250 (0.083171) 0.061800 / 0.037052 (0.024748) 0.459085 / 0.258489 (0.200596) 0.488897 / 0.293841 (0.195056) 0.048157 / 0.128546 (-0.080389) 0.014676 / 0.075646 (-0.060970) 0.104372 / 0.419271 (-0.314900) 0.058066 / 0.043533 (0.014534) 0.446131 / 0.255139 (0.190992) 0.460428 / 0.283200 (0.177228) 0.128492 / 0.141683 (-0.013191) 1.811419 / 1.452155 (0.359265) 1.894781 / 1.492716 (0.402064)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.220527 / 0.018006 (0.202520) 0.487663 / 0.000490 (0.487173) 0.003864 / 0.000200 (0.003664) 0.000162 / 0.000054 (0.000107)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036354 / 0.037411 (-0.001057) 0.140469 / 0.014526 (0.125944) 0.149990 / 0.176557 (-0.026566) 0.212369 / 0.737135 (-0.524766) 0.154000 / 0.296338 (-0.142338)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.514172 / 0.215209 (0.298963) 5.129247 / 2.077655 (3.051593) 2.536773 / 1.504120 (1.032653) 2.317253 / 1.541195 (0.776058) 2.424066 / 1.468490 (0.955576) 0.836160 / 4.584777 (-3.748617) 4.906235 / 3.745712 (1.160523) 4.431395 / 5.269862 (-0.838467) 2.332845 / 4.565676 (-2.232831) 0.102867 / 0.424275 (-0.321409) 0.014851 / 0.007607 (0.007244) 0.644104 / 0.226044 (0.418060) 6.415847 / 2.268929 (4.146918) 3.186984 / 55.444624 (-52.257641) 2.774125 / 6.876477 (-4.102352) 2.848045 / 2.142072 (0.705972) 1.018757 / 4.805227 (-3.786470) 0.212333 / 6.500664 (-6.288331) 0.079405 / 0.075469 (0.003936)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.748375 / 1.841788 (-0.093412) 19.733829 / 8.074308 (11.659521) 15.766665 / 10.191392 (5.575273) 0.192087 / 0.680424 (-0.488337) 0.027641 / 0.534201 (-0.506560) 0.504101 / 0.579283 (-0.075182) 0.493815 / 0.434364 (0.059451) 0.583247 / 0.540337 (0.042910) 0.697432 / 1.386936 (-0.689504)

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented May 16, 2023

Hi @lhoestq, I tried moving everything to the NumPy path but ran into issues - the SharedMemory constructs it depends on were only added in Python 3.8. As a result, if we move everything to that path then to_tf_dataset does not work on older Python versions.

For now, how do you feel about reverting and using my original solution, which has fallbacks for all versions of Python and TensorFlow? Once our minimum versions pass Python 3.8 or TF 2.9 we can remove the older code paths.

@Rocketknight1
Copy link
Member Author

Gentle ping on this question @lhoestq!

@lhoestq
Copy link
Member

lhoestq commented May 19, 2023

Ah yes indeed. Feel free to revert and add comments to explain why you needed to have a different approach for single process

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008395 / 0.011353 (-0.002958) 0.005773 / 0.011008 (-0.005235) 0.115702 / 0.038508 (0.077194) 0.039897 / 0.023109 (0.016788) 0.483140 / 0.275898 (0.207242) 0.531288 / 0.323480 (0.207808) 0.006739 / 0.007986 (-0.001246) 0.004419 / 0.004328 (0.000090) 0.086374 / 0.004250 (0.082124) 0.056498 / 0.037052 (0.019446) 0.491589 / 0.258489 (0.233100) 0.556366 / 0.293841 (0.262525) 0.041366 / 0.128546 (-0.087181) 0.014373 / 0.075646 (-0.061274) 0.395504 / 0.419271 (-0.023767) 0.094382 / 0.043533 (0.050849) 0.483000 / 0.255139 (0.227861) 0.522693 / 0.283200 (0.239494) 0.138804 / 0.141683 (-0.002879) 1.719563 / 1.452155 (0.267409) 1.853470 / 1.492716 (0.360753)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.235616 / 0.018006 (0.217610) 0.483267 / 0.000490 (0.482777) 0.008663 / 0.000200 (0.008463) 0.000401 / 0.000054 (0.000347)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033124 / 0.037411 (-0.004287) 0.128821 / 0.014526 (0.114295) 0.138910 / 0.176557 (-0.037647) 0.213570 / 0.737135 (-0.523566) 0.146646 / 0.296338 (-0.149693)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.479998 / 0.215209 (0.264789) 4.772325 / 2.077655 (2.694670) 2.228424 / 1.504120 (0.724304) 2.000915 / 1.541195 (0.459721) 2.105799 / 1.468490 (0.637309) 0.824235 / 4.584777 (-3.760542) 4.511902 / 3.745712 (0.766189) 4.723073 / 5.269862 (-0.546789) 2.333442 / 4.565676 (-2.232235) 0.101161 / 0.424275 (-0.323114) 0.014403 / 0.007607 (0.006796) 0.596395 / 0.226044 (0.370351) 5.961046 / 2.268929 (3.692117) 2.746679 / 55.444624 (-52.697946) 2.352085 / 6.876477 (-4.524392) 2.609812 / 2.142072 (0.467740) 0.996950 / 4.805227 (-3.808277) 0.197923 / 6.500664 (-6.302741) 0.075546 / 0.075469 (0.000077)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.529896 / 1.841788 (-0.311892) 18.183887 / 8.074308 (10.109578) 16.352332 / 10.191392 (6.160940) 0.213504 / 0.680424 (-0.466920) 0.020388 / 0.534201 (-0.513813) 0.497832 / 0.579283 (-0.081451) 0.495477 / 0.434364 (0.061113) 0.585984 / 0.540337 (0.045647) 0.688726 / 1.386936 (-0.698210)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008422 / 0.011353 (-0.002931) 0.005876 / 0.011008 (-0.005132) 0.089310 / 0.038508 (0.050802) 0.039769 / 0.023109 (0.016660) 0.425279 / 0.275898 (0.149381) 0.470818 / 0.323480 (0.147338) 0.006519 / 0.007986 (-0.001467) 0.006276 / 0.004328 (0.001948) 0.085753 / 0.004250 (0.081503) 0.053867 / 0.037052 (0.016815) 0.429193 / 0.258489 (0.170704) 0.480278 / 0.293841 (0.186437) 0.040657 / 0.128546 (-0.087889) 0.014055 / 0.075646 (-0.061591) 0.101422 / 0.419271 (-0.317849) 0.053803 / 0.043533 (0.010271) 0.428348 / 0.255139 (0.173209) 0.452193 / 0.283200 (0.168994) 0.124914 / 0.141683 (-0.016769) 1.750122 / 1.452155 (0.297968) 1.850875 / 1.492716 (0.358159)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.249958 / 0.018006 (0.231952) 0.485183 / 0.000490 (0.484694) 0.000472 / 0.000200 (0.000272) 0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034563 / 0.037411 (-0.002848) 0.135565 / 0.014526 (0.121039) 0.143271 / 0.176557 (-0.033285) 0.199080 / 0.737135 (-0.538056) 0.149336 / 0.296338 (-0.147003)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.526170 / 0.215209 (0.310961) 5.270960 / 2.077655 (3.193305) 2.664585 / 1.504120 (1.160465) 2.440027 / 1.541195 (0.898832) 2.612764 / 1.468490 (1.144274) 0.828965 / 4.584777 (-3.755812) 4.769983 / 3.745712 (1.024271) 2.441962 / 5.269862 (-2.827900) 1.549032 / 4.565676 (-3.016644) 0.100851 / 0.424275 (-0.323424) 0.014425 / 0.007607 (0.006818) 0.640908 / 0.226044 (0.414864) 6.399041 / 2.268929 (4.130113) 3.242424 / 55.444624 (-52.202200) 2.836317 / 6.876477 (-4.040160) 2.933010 / 2.142072 (0.790938) 1.002277 / 4.805227 (-3.802950) 0.201247 / 6.500664 (-6.299417) 0.078777 / 0.075469 (0.003308)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.620415 / 1.841788 (-0.221373) 19.153631 / 8.074308 (11.079323) 16.744068 / 10.191392 (6.552676) 0.167327 / 0.680424 (-0.513097) 0.020186 / 0.534201 (-0.514015) 0.503683 / 0.579283 (-0.075600) 0.500051 / 0.434364 (0.065687) 0.587188 / 0.540337 (0.046850) 0.699975 / 1.386936 (-0.686961)

@Rocketknight1
Copy link
Member Author

This is probably ready, but likely conflicts with #5883. I'll wait for that PR to be merged and then rebase and merge this one.

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008387 / 0.011353 (-0.002965) 0.005824 / 0.011008 (-0.005184) 0.117721 / 0.038508 (0.079213) 0.040420 / 0.023109 (0.017311) 0.404961 / 0.275898 (0.129063) 0.426695 / 0.323480 (0.103215) 0.006634 / 0.007986 (-0.001352) 0.006033 / 0.004328 (0.001705) 0.088652 / 0.004250 (0.084402) 0.048075 / 0.037052 (0.011022) 0.400683 / 0.258489 (0.142194) 0.432489 / 0.293841 (0.138648) 0.042065 / 0.128546 (-0.086482) 0.014071 / 0.075646 (-0.061575) 0.399398 / 0.419271 (-0.019873) 0.066034 / 0.043533 (0.022501) 0.400056 / 0.255139 (0.144918) 0.421130 / 0.283200 (0.137930) 0.119721 / 0.141683 (-0.021962) 1.752166 / 1.452155 (0.300011) 1.820161 / 1.492716 (0.327444)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.244264 / 0.018006 (0.226258) 0.480882 / 0.000490 (0.480392) 0.005604 / 0.000200 (0.005404) 0.000175 / 0.000054 (0.000121)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032397 / 0.037411 (-0.005015) 0.131632 / 0.014526 (0.117106) 0.139765 / 0.176557 (-0.036792) 0.213135 / 0.737135 (-0.524000) 0.147891 / 0.296338 (-0.148447)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474534 / 0.215209 (0.259325) 4.730424 / 2.077655 (2.652770) 2.163706 / 1.504120 (0.659586) 1.936051 / 1.541195 (0.394857) 2.012185 / 1.468490 (0.543695) 0.826583 / 4.584777 (-3.758194) 4.921494 / 3.745712 (1.175782) 2.431401 / 5.269862 (-2.838460) 1.566020 / 4.565676 (-2.999656) 0.101255 / 0.424275 (-0.323020) 0.014553 / 0.007607 (0.006946) 0.608301 / 0.226044 (0.382256) 6.089801 / 2.268929 (3.820873) 2.691986 / 55.444624 (-52.752638) 2.296498 / 6.876477 (-4.579979) 2.455388 / 2.142072 (0.313315) 0.984342 / 4.805227 (-3.820885) 0.200447 / 6.500664 (-6.300217) 0.077602 / 0.075469 (0.002133)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.445067 / 1.841788 (-0.396721) 18.588670 / 8.074308 (10.514362) 16.950216 / 10.191392 (6.758824) 0.169688 / 0.680424 (-0.510736) 0.020544 / 0.534201 (-0.513657) 0.508506 / 0.579283 (-0.070777) 0.516218 / 0.434364 (0.081854) 0.646072 / 0.540337 (0.105734) 0.763227 / 1.386936 (-0.623709)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008816 / 0.011353 (-0.002537) 0.006016 / 0.011008 (-0.004992) 0.090946 / 0.038508 (0.052438) 0.040189 / 0.023109 (0.017080) 0.446723 / 0.275898 (0.170825) 0.494633 / 0.323480 (0.171153) 0.007206 / 0.007986 (-0.000779) 0.004508 / 0.004328 (0.000180) 0.088477 / 0.004250 (0.084226) 0.055587 / 0.037052 (0.018535) 0.445349 / 0.258489 (0.186860) 0.504940 / 0.293841 (0.211099) 0.041976 / 0.128546 (-0.086570) 0.014296 / 0.075646 (-0.061351) 0.102835 / 0.419271 (-0.316436) 0.054786 / 0.043533 (0.011253) 0.444789 / 0.255139 (0.189651) 0.472306 / 0.283200 (0.189106) 0.123365 / 0.141683 (-0.018318) 1.725803 / 1.452155 (0.273648) 1.832216 / 1.492716 (0.339500)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.252680 / 0.018006 (0.234674) 0.476719 / 0.000490 (0.476229) 0.000461 / 0.000200 (0.000261) 0.000067 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035961 / 0.037411 (-0.001450) 0.135399 / 0.014526 (0.120873) 0.147549 / 0.176557 (-0.029007) 0.207468 / 0.737135 (-0.529667) 0.151591 / 0.296338 (-0.144747)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.528143 / 0.215209 (0.312934) 5.270766 / 2.077655 (3.193111) 2.675644 / 1.504120 (1.171524) 2.472855 / 1.541195 (0.931660) 2.636020 / 1.468490 (1.167530) 0.841325 / 4.584777 (-3.743452) 4.702290 / 3.745712 (0.956578) 2.523537 / 5.269862 (-2.746325) 1.595617 / 4.565676 (-2.970059) 0.102095 / 0.424275 (-0.322180) 0.014568 / 0.007607 (0.006961) 0.652090 / 0.226044 (0.426046) 6.503086 / 2.268929 (4.234158) 3.277025 / 55.444624 (-52.167599) 2.931264 / 6.876477 (-3.945213) 3.021667 / 2.142072 (0.879594) 1.002560 / 4.805227 (-3.802668) 0.202621 / 6.500664 (-6.298043) 0.080583 / 0.075469 (0.005114)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.639281 / 1.841788 (-0.202507) 18.911529 / 8.074308 (10.837220) 17.082795 / 10.191392 (6.891403) 0.179456 / 0.680424 (-0.500968) 0.021740 / 0.534201 (-0.512460) 0.526426 / 0.579283 (-0.052857) 0.535083 / 0.434364 (0.100719) 0.583304 / 0.540337 (0.042967) 0.696733 / 1.386936 (-0.690203)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006823 / 0.011353 (-0.004530) 0.004847 / 0.011008 (-0.006161) 0.096038 / 0.038508 (0.057530) 0.033037 / 0.023109 (0.009928) 0.298379 / 0.275898 (0.022481) 0.333319 / 0.323480 (0.009839) 0.005343 / 0.007986 (-0.002643) 0.003863 / 0.004328 (-0.000465) 0.072928 / 0.004250 (0.068678) 0.040898 / 0.037052 (0.003846) 0.303116 / 0.258489 (0.044627) 0.334021 / 0.293841 (0.040181) 0.034780 / 0.128546 (-0.093767) 0.011978 / 0.075646 (-0.063668) 0.331642 / 0.419271 (-0.087629) 0.052729 / 0.043533 (0.009196) 0.298586 / 0.255139 (0.043447) 0.319296 / 0.283200 (0.036097) 0.097711 / 0.141683 (-0.043972) 1.416899 / 1.452155 (-0.035256) 1.546008 / 1.492716 (0.053292)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.234303 / 0.018006 (0.216296) 0.492767 / 0.000490 (0.492278) 0.004935 / 0.000200 (0.004736) 0.000106 / 0.000054 (0.000051)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030617 / 0.037411 (-0.006795) 0.121203 / 0.014526 (0.106677) 0.126677 / 0.176557 (-0.049879) 0.186379 / 0.737135 (-0.550756) 0.129849 / 0.296338 (-0.166490)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.416324 / 0.215209 (0.201115) 4.135563 / 2.077655 (2.057908) 1.976182 / 1.504120 (0.472062) 1.807611 / 1.541195 (0.266416) 1.886282 / 1.468490 (0.417792) 0.713006 / 4.584777 (-3.871771) 3.899205 / 3.745712 (0.153493) 2.283427 / 5.269862 (-2.986435) 1.543088 / 4.565676 (-3.022589) 0.086189 / 0.424275 (-0.338087) 0.012908 / 0.007607 (0.005301) 0.516156 / 0.226044 (0.290112) 5.144199 / 2.268929 (2.875271) 2.460142 / 55.444624 (-52.984482) 2.209054 / 6.876477 (-4.667423) 2.325277 / 2.142072 (0.183204) 0.849890 / 4.805227 (-3.955337) 0.173687 / 6.500664 (-6.326977) 0.070178 / 0.075469 (-0.005291)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.241790 / 1.841788 (-0.599997) 16.047257 / 8.074308 (7.972949) 15.774146 / 10.191392 (5.582754) 0.145871 / 0.680424 (-0.534553) 0.018106 / 0.534201 (-0.516095) 0.433642 / 0.579283 (-0.145641) 0.425311 / 0.434364 (-0.009053) 0.533963 / 0.540337 (-0.006375) 0.638786 / 1.386936 (-0.748151)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007242 / 0.011353 (-0.004111) 0.005599 / 0.011008 (-0.005410) 0.073443 / 0.038508 (0.034935) 0.033764 / 0.023109 (0.010655) 0.365990 / 0.275898 (0.090092) 0.392943 / 0.323480 (0.069463) 0.005987 / 0.007986 (-0.001999) 0.004312 / 0.004328 (-0.000016) 0.072831 / 0.004250 (0.068580) 0.048854 / 0.037052 (0.011802) 0.362477 / 0.258489 (0.103988) 0.399993 / 0.293841 (0.106152) 0.035602 / 0.128546 (-0.092944) 0.012445 / 0.075646 (-0.063202) 0.085768 / 0.419271 (-0.333504) 0.048544 / 0.043533 (0.005011) 0.362246 / 0.255139 (0.107107) 0.388753 / 0.283200 (0.105554) 0.109829 / 0.141683 (-0.031854) 1.546881 / 1.452155 (0.094726) 1.619454 / 1.492716 (0.126737)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.189926 / 0.018006 (0.171920) 0.447936 / 0.000490 (0.447446) 0.002354 / 0.000200 (0.002155) 0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031740 / 0.037411 (-0.005671) 0.122595 / 0.014526 (0.108069) 0.128389 / 0.176557 (-0.048168) 0.180570 / 0.737135 (-0.556566) 0.132939 / 0.296338 (-0.163399)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.425073 / 0.215209 (0.209863) 4.238964 / 2.077655 (2.161309) 2.095116 / 1.504120 (0.590996) 1.913925 / 1.541195 (0.372730) 2.024669 / 1.468490 (0.556179) 0.699172 / 4.584777 (-3.885605) 3.845807 / 3.745712 (0.100094) 2.167502 / 5.269862 (-3.102360) 1.375267 / 4.565676 (-3.190410) 0.086739 / 0.424275 (-0.337536) 0.012198 / 0.007607 (0.004591) 0.525975 / 0.226044 (0.299931) 5.249449 / 2.268929 (2.980521) 2.550565 / 55.444624 (-52.894060) 2.257557 / 6.876477 (-4.618920) 2.298936 / 2.142072 (0.156863) 0.850295 / 4.805227 (-3.954932) 0.170506 / 6.500664 (-6.330158) 0.065659 / 0.075469 (-0.009810)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.330556 / 1.841788 (-0.511231) 16.920203 / 8.074308 (8.845894) 15.966739 / 10.191392 (5.775347) 0.164000 / 0.680424 (-0.516424) 0.018211 / 0.534201 (-0.515990) 0.436253 / 0.579283 (-0.143030) 0.449666 / 0.434364 (0.015302) 0.522287 / 0.540337 (-0.018050) 0.615944 / 1.386936 (-0.770992)

@Rocketknight1 Rocketknight1 force-pushed the reduce_to_tf_dataset_memory_usage branch from 824f96c to b899ea4 Compare May 24, 2023 15:56
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007273 / 0.011353 (-0.004080) 0.005198 / 0.011008 (-0.005810) 0.114362 / 0.038508 (0.075854) 0.031113 / 0.023109 (0.008003) 0.378568 / 0.275898 (0.102670) 0.441695 / 0.323480 (0.118215) 0.006037 / 0.007986 (-0.001949) 0.005102 / 0.004328 (0.000774) 0.098682 / 0.004250 (0.094432) 0.042797 / 0.037052 (0.005745) 0.360028 / 0.258489 (0.101539) 0.435757 / 0.293841 (0.141916) 0.041438 / 0.128546 (-0.087109) 0.013728 / 0.075646 (-0.061918) 0.376154 / 0.419271 (-0.043117) 0.075324 / 0.043533 (0.031791) 0.357221 / 0.255139 (0.102082) 0.416378 / 0.283200 (0.133178) 0.110707 / 0.141683 (-0.030975) 1.603215 / 1.452155 (0.151061) 1.736843 / 1.492716 (0.244127)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.249479 / 0.018006 (0.231473) 0.513205 / 0.000490 (0.512715) 0.003856 / 0.000200 (0.003656) 0.000100 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027750 / 0.037411 (-0.009661) 0.105437 / 0.014526 (0.090911) 0.115903 / 0.176557 (-0.060653) 0.179662 / 0.737135 (-0.557474) 0.116305 / 0.296338 (-0.180033)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.551681 / 0.215209 (0.336472) 5.544590 / 2.077655 (3.466935) 2.193933 / 1.504120 (0.689813) 1.898395 / 1.541195 (0.357201) 1.877288 / 1.468490 (0.408798) 0.858097 / 4.584777 (-3.726680) 4.920982 / 3.745712 (1.175270) 2.478220 / 5.269862 (-2.791641) 1.779608 / 4.565676 (-2.786069) 0.101321 / 0.424275 (-0.322954) 0.012627 / 0.007607 (0.005020) 0.674865 / 0.226044 (0.448820) 6.808224 / 2.268929 (4.539295) 2.822466 / 55.444624 (-52.622159) 2.170379 / 6.876477 (-4.706098) 2.224278 / 2.142072 (0.082205) 1.032763 / 4.805227 (-3.772464) 0.198851 / 6.500664 (-6.301813) 0.069249 / 0.075469 (-0.006220)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.425987 / 1.841788 (-0.415801) 16.212942 / 8.074308 (8.138634) 18.945770 / 10.191392 (8.754378) 0.192901 / 0.680424 (-0.487522) 0.025343 / 0.534201 (-0.508858) 0.465441 / 0.579283 (-0.113842) 0.540966 / 0.434364 (0.106602) 0.576736 / 0.540337 (0.036399) 0.675717 / 1.386936 (-0.711219)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007426 / 0.011353 (-0.003927) 0.005023 / 0.011008 (-0.005985) 0.085083 / 0.038508 (0.046575) 0.030559 / 0.023109 (0.007449) 0.398461 / 0.275898 (0.122563) 0.418998 / 0.323480 (0.095518) 0.006697 / 0.007986 (-0.001288) 0.004665 / 0.004328 (0.000337) 0.087724 / 0.004250 (0.083473) 0.045799 / 0.037052 (0.008747) 0.395165 / 0.258489 (0.136676) 0.430172 / 0.293841 (0.136331) 0.040486 / 0.128546 (-0.088060) 0.014237 / 0.075646 (-0.061409) 0.099429 / 0.419271 (-0.319843) 0.056006 / 0.043533 (0.012473) 0.389046 / 0.255139 (0.133907) 0.419559 / 0.283200 (0.136359) 0.108550 / 0.141683 (-0.033132) 1.614052 / 1.452155 (0.161897) 1.677785 / 1.492716 (0.185069)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.202178 / 0.018006 (0.184172) 0.486365 / 0.000490 (0.485875) 0.003844 / 0.000200 (0.003644) 0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027963 / 0.037411 (-0.009449) 0.110399 / 0.014526 (0.095873) 0.122266 / 0.176557 (-0.054291) 0.178551 / 0.737135 (-0.558585) 0.129259 / 0.296338 (-0.167080)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.604178 / 0.215209 (0.388969) 6.135943 / 2.077655 (4.058288) 2.547576 / 1.504120 (1.043456) 2.262470 / 1.541195 (0.721276) 2.275402 / 1.468490 (0.806912) 0.878804 / 4.584777 (-3.705972) 5.152200 / 3.745712 (1.406488) 2.553715 / 5.269862 (-2.716147) 1.580959 / 4.565676 (-2.984717) 0.107895 / 0.424275 (-0.316380) 0.012751 / 0.007607 (0.005143) 0.770678 / 0.226044 (0.544633) 7.744303 / 2.268929 (5.475374) 3.342037 / 55.444624 (-52.102588) 2.756848 / 6.876477 (-4.119629) 2.739357 / 2.142072 (0.597285) 1.086330 / 4.805227 (-3.718897) 0.230983 / 6.500664 (-6.269681) 0.073771 / 0.075469 (-0.001698)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.493441 / 1.841788 (-0.348347) 16.621611 / 8.074308 (8.547303) 19.081000 / 10.191392 (8.889608) 0.215623 / 0.680424 (-0.464801) 0.025660 / 0.534201 (-0.508541) 0.446490 / 0.579283 (-0.132793) 0.560078 / 0.434364 (0.125714) 0.527231 / 0.540337 (-0.013106) 0.636551 / 1.386936 (-0.750385)

@Rocketknight1 Rocketknight1 force-pushed the reduce_to_tf_dataset_memory_usage branch from b899ea4 to 81761db Compare June 7, 2023 14:43
@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008266 / 0.011353 (-0.003087) 0.005082 / 0.011008 (-0.005927) 0.119858 / 0.038508 (0.081350) 0.032907 / 0.023109 (0.009798) 0.362816 / 0.275898 (0.086918) 0.403684 / 0.323480 (0.080204) 0.006296 / 0.007986 (-0.001690) 0.006220 / 0.004328 (0.001891) 0.095609 / 0.004250 (0.091359) 0.048734 / 0.037052 (0.011682) 0.385724 / 0.258489 (0.127235) 0.424315 / 0.293841 (0.130475) 0.042344 / 0.128546 (-0.086202) 0.016147 / 0.075646 (-0.059500) 0.409661 / 0.419271 (-0.009610) 0.057900 / 0.043533 (0.014367) 0.387013 / 0.255139 (0.131874) 0.388901 / 0.283200 (0.105702) 0.103920 / 0.141683 (-0.037762) 1.732730 / 1.452155 (0.280575) 1.863912 / 1.492716 (0.371196)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.237406 / 0.018006 (0.219400) 0.514398 / 0.000490 (0.513909) 0.005941 / 0.000200 (0.005741) 0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027524 / 0.037411 (-0.009888) 0.116498 / 0.014526 (0.101972) 0.129034 / 0.176557 (-0.047522) 0.218272 / 0.737135 (-0.518864) 0.148389 / 0.296338 (-0.147950)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.604555 / 0.215209 (0.389346) 5.921576 / 2.077655 (3.843921) 2.410483 / 1.504120 (0.906363) 2.220286 / 1.541195 (0.679092) 2.138880 / 1.468490 (0.670390) 0.934962 / 4.584777 (-3.649815) 5.808855 / 3.745712 (2.063143) 4.881554 / 5.269862 (-0.388308) 2.536408 / 4.565676 (-2.029268) 0.124260 / 0.424275 (-0.300015) 0.017798 / 0.007607 (0.010190) 0.778991 / 0.226044 (0.552947) 7.899262 / 2.268929 (5.630333) 3.208667 / 55.444624 (-52.235957) 2.631182 / 6.876477 (-4.245295) 2.676199 / 2.142072 (0.534127) 1.165516 / 4.805227 (-3.639711) 0.228751 / 6.500664 (-6.271913) 0.081378 / 0.075469 (0.005909)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.522156 / 1.841788 (-0.319632) 17.975381 / 8.074308 (9.901073) 18.918882 / 10.191392 (8.727490) 0.223984 / 0.680424 (-0.456440) 0.025171 / 0.534201 (-0.509030) 0.467894 / 0.579283 (-0.111389) 0.559501 / 0.434364 (0.125137) 0.550392 / 0.540337 (0.010055) 0.696923 / 1.386936 (-0.690013)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008577 / 0.011353 (-0.002775) 0.006735 / 0.011008 (-0.004273) 0.095108 / 0.038508 (0.056600) 0.035059 / 0.023109 (0.011950) 0.448576 / 0.275898 (0.172677) 0.492049 / 0.323480 (0.168569) 0.006600 / 0.007986 (-0.001385) 0.004760 / 0.004328 (0.000431) 0.094670 / 0.004250 (0.090419) 0.052543 / 0.037052 (0.015491) 0.458927 / 0.258489 (0.200438) 0.511522 / 0.293841 (0.217681) 0.046046 / 0.128546 (-0.082500) 0.015227 / 0.075646 (-0.060419) 0.114585 / 0.419271 (-0.304686) 0.057569 / 0.043533 (0.014036) 0.441989 / 0.255139 (0.186850) 0.487001 / 0.283200 (0.203801) 0.115688 / 0.141683 (-0.025995) 1.777366 / 1.452155 (0.325211) 1.906216 / 1.492716 (0.413499)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.224880 / 0.018006 (0.206874) 0.504153 / 0.000490 (0.503664) 0.001143 / 0.000200 (0.000943) 0.000111 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033618 / 0.037411 (-0.003793) 0.127396 / 0.014526 (0.112870) 0.135648 / 0.176557 (-0.040909) 0.193140 / 0.737135 (-0.543995) 0.142129 / 0.296338 (-0.154209)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.692845 / 0.215209 (0.477636) 6.804897 / 2.077655 (4.727242) 2.851041 / 1.504120 (1.346921) 2.480698 / 1.541195 (0.939504) 2.488619 / 1.468490 (1.020129) 0.970439 / 4.584777 (-3.614338) 5.466059 / 3.745712 (1.720347) 2.790261 / 5.269862 (-2.479601) 1.727638 / 4.565676 (-2.838039) 0.116345 / 0.424275 (-0.307930) 0.014348 / 0.007607 (0.006740) 0.845510 / 0.226044 (0.619465) 8.397198 / 2.268929 (6.128270) 3.591998 / 55.444624 (-51.852626) 2.858339 / 6.876477 (-4.018137) 2.905075 / 2.142072 (0.763003) 1.193569 / 4.805227 (-3.611658) 0.243091 / 6.500664 (-6.257573) 0.082198 / 0.075469 (0.006729)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.610327 / 1.841788 (-0.231461) 17.191414 / 8.074308 (9.117106) 20.176518 / 10.191392 (9.985126) 0.246574 / 0.680424 (-0.433850) 0.024343 / 0.534201 (-0.509858) 0.482091 / 0.579283 (-0.097192) 0.585241 / 0.434364 (0.150877) 0.558833 / 0.540337 (0.018496) 0.654811 / 1.386936 (-0.732125)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006353 / 0.011353 (-0.004999) 0.004393 / 0.011008 (-0.006616) 0.098751 / 0.038508 (0.060242) 0.029090 / 0.023109 (0.005981) 0.304169 / 0.275898 (0.028271) 0.339879 / 0.323480 (0.016399) 0.005577 / 0.007986 (-0.002408) 0.003516 / 0.004328 (-0.000813) 0.077347 / 0.004250 (0.073097) 0.041935 / 0.037052 (0.004882) 0.305865 / 0.258489 (0.047376) 0.357063 / 0.293841 (0.063222) 0.025245 / 0.128546 (-0.103301) 0.008753 / 0.075646 (-0.066893) 0.316734 / 0.419271 (-0.102538) 0.043464 / 0.043533 (-0.000069) 0.300944 / 0.255139 (0.045805) 0.330091 / 0.283200 (0.046891) 0.088593 / 0.141683 (-0.053090) 1.588958 / 1.452155 (0.136803) 1.641376 / 1.492716 (0.148660)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.220290 / 0.018006 (0.202284) 0.445430 / 0.000490 (0.444940) 0.004800 / 0.000200 (0.004600) 0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023828 / 0.037411 (-0.013583) 0.103446 / 0.014526 (0.088920) 0.110668 / 0.176557 (-0.065889) 0.169604 / 0.737135 (-0.567531) 0.114818 / 0.296338 (-0.181520)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.416951 / 0.215209 (0.201742) 4.138917 / 2.077655 (2.061263) 1.891265 / 1.504120 (0.387145) 1.687068 / 1.541195 (0.145873) 1.726618 / 1.468490 (0.258128) 0.546977 / 4.584777 (-4.037800) 3.536153 / 3.745712 (-0.209560) 1.795206 / 5.269862 (-3.474656) 1.019845 / 4.565676 (-3.545831) 0.067040 / 0.424275 (-0.357235) 0.012038 / 0.007607 (0.004431) 0.520583 / 0.226044 (0.294539) 5.211520 / 2.268929 (2.942591) 2.336136 / 55.444624 (-53.108488) 2.011262 / 6.876477 (-4.865215) 2.137311 / 2.142072 (-0.004762) 0.654779 / 4.805227 (-4.150448) 0.134555 / 6.500664 (-6.366109) 0.066427 / 0.075469 (-0.009042)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.240187 / 1.841788 (-0.601600) 14.104063 / 8.074308 (6.029755) 13.369572 / 10.191392 (3.178180) 0.147891 / 0.680424 (-0.532533) 0.016993 / 0.534201 (-0.517208) 0.364863 / 0.579283 (-0.214420) 0.398684 / 0.434364 (-0.035680) 0.430524 / 0.540337 (-0.109813) 0.520920 / 1.386936 (-0.866016)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006845 / 0.011353 (-0.004508) 0.004420 / 0.011008 (-0.006588) 0.078334 / 0.038508 (0.039825) 0.030566 / 0.023109 (0.007457) 0.409568 / 0.275898 (0.133670) 0.458389 / 0.323480 (0.134910) 0.005739 / 0.007986 (-0.002247) 0.005222 / 0.004328 (0.000893) 0.076066 / 0.004250 (0.071816) 0.049239 / 0.037052 (0.012187) 0.409841 / 0.258489 (0.151352) 0.472250 / 0.293841 (0.178409) 0.025463 / 0.128546 (-0.103084) 0.008738 / 0.075646 (-0.066909) 0.083114 / 0.419271 (-0.336157) 0.041233 / 0.043533 (-0.002300) 0.407158 / 0.255139 (0.152019) 0.438724 / 0.283200 (0.155524) 0.097974 / 0.141683 (-0.043709) 1.536514 / 1.452155 (0.084360) 1.636704 / 1.492716 (0.143987)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.240589 / 0.018006 (0.222583) 0.440328 / 0.000490 (0.439838) 0.000937 / 0.000200 (0.000737) 0.000076 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027559 / 0.037411 (-0.009853) 0.109930 / 0.014526 (0.095405) 0.113366 / 0.176557 (-0.063190) 0.166849 / 0.737135 (-0.570286) 0.118872 / 0.296338 (-0.177467)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474120 / 0.215209 (0.258911) 4.739222 / 2.077655 (2.661567) 2.484386 / 1.504120 (0.980266) 2.281937 / 1.541195 (0.740742) 2.362974 / 1.468490 (0.894484) 0.549897 / 4.584777 (-4.034879) 3.425540 / 3.745712 (-0.320172) 1.765810 / 5.269862 (-3.504051) 1.008277 / 4.565676 (-3.557400) 0.067288 / 0.424275 (-0.356987) 0.011954 / 0.007607 (0.004347) 0.577216 / 0.226044 (0.351172) 5.790659 / 2.268929 (3.521731) 2.946732 / 55.444624 (-52.497892) 2.608835 / 6.876477 (-4.267641) 2.642987 / 2.142072 (0.500915) 0.652798 / 4.805227 (-4.152429) 0.135909 / 6.500664 (-6.364755) 0.068480 / 0.075469 (-0.006989)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.353550 / 1.841788 (-0.488237) 14.732084 / 8.074308 (6.657775) 14.439174 / 10.191392 (4.247782) 0.131445 / 0.680424 (-0.548979) 0.016608 / 0.534201 (-0.517593) 0.368103 / 0.579283 (-0.211180) 0.393918 / 0.434364 (-0.040446) 0.423562 / 0.540337 (-0.116776) 0.515041 / 1.386936 (-0.871895)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006414 / 0.011353 (-0.004938) 0.004704 / 0.011008 (-0.006305) 0.096012 / 0.038508 (0.057504) 0.032910 / 0.023109 (0.009800) 0.290676 / 0.275898 (0.014778) 0.319646 / 0.323480 (-0.003834) 0.005806 / 0.007986 (-0.002180) 0.004008 / 0.004328 (-0.000320) 0.073982 / 0.004250 (0.069731) 0.048985 / 0.037052 (0.011933) 0.299498 / 0.258489 (0.041009) 0.338118 / 0.293841 (0.044277) 0.027680 / 0.128546 (-0.100866) 0.009051 / 0.075646 (-0.066595) 0.325051 / 0.419271 (-0.094221) 0.051011 / 0.043533 (0.007478) 0.292249 / 0.255139 (0.037110) 0.315733 / 0.283200 (0.032533) 0.100327 / 0.141683 (-0.041356) 1.481862 / 1.452155 (0.029707) 1.544884 / 1.492716 (0.052168)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.289610 / 0.018006 (0.271603) 0.510164 / 0.000490 (0.509675) 0.004726 / 0.000200 (0.004526) 0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027617 / 0.037411 (-0.009794) 0.107593 / 0.014526 (0.093068) 0.122783 / 0.176557 (-0.053774) 0.181086 / 0.737135 (-0.556049) 0.128030 / 0.296338 (-0.168308)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.403571 / 0.215209 (0.188362) 4.002881 / 2.077655 (1.925227) 1.805550 / 1.504120 (0.301430) 1.619165 / 1.541195 (0.077971) 1.606536 / 1.468490 (0.138046) 0.518917 / 4.584777 (-4.065860) 3.731498 / 3.745712 (-0.014214) 3.206645 / 5.269862 (-2.063217) 1.641615 / 4.565676 (-2.924062) 0.065100 / 0.424275 (-0.359175) 0.011396 / 0.007607 (0.003789) 0.500597 / 0.226044 (0.274553) 4.992293 / 2.268929 (2.723364) 2.278726 / 55.444624 (-53.165898) 1.960823 / 6.876477 (-4.915654) 2.038684 / 2.142072 (-0.103388) 0.640910 / 4.805227 (-4.164318) 0.140597 / 6.500664 (-6.360067) 0.062114 / 0.075469 (-0.013355)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.167366 / 1.841788 (-0.674422) 14.748193 / 8.074308 (6.673884) 13.592381 / 10.191392 (3.400989) 0.165341 / 0.680424 (-0.515083) 0.017360 / 0.534201 (-0.516841) 0.393448 / 0.579283 (-0.185836) 0.422951 / 0.434364 (-0.011413) 0.460491 / 0.540337 (-0.079847) 0.558238 / 1.386936 (-0.828698)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006373 / 0.011353 (-0.004980) 0.004587 / 0.011008 (-0.006421) 0.076421 / 0.038508 (0.037913) 0.032162 / 0.023109 (0.009052) 0.385531 / 0.275898 (0.109633) 0.410424 / 0.323480 (0.086944) 0.006154 / 0.007986 (-0.001832) 0.005533 / 0.004328 (0.001205) 0.077035 / 0.004250 (0.072784) 0.051571 / 0.037052 (0.014519) 0.393283 / 0.258489 (0.134794) 0.433756 / 0.293841 (0.139915) 0.028381 / 0.128546 (-0.100165) 0.009034 / 0.075646 (-0.066613) 0.083836 / 0.419271 (-0.335435) 0.048246 / 0.043533 (0.004713) 0.385437 / 0.255139 (0.130298) 0.394187 / 0.283200 (0.110987) 0.105453 / 0.141683 (-0.036230) 1.459173 / 1.452155 (0.007018) 1.575083 / 1.492716 (0.082367)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.320324 / 0.018006 (0.302318) 0.502945 / 0.000490 (0.502455) 0.004470 / 0.000200 (0.004270) 0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028118 / 0.037411 (-0.009293) 0.111430 / 0.014526 (0.096904) 0.123141 / 0.176557 (-0.053415) 0.175215 / 0.737135 (-0.561920) 0.126429 / 0.296338 (-0.169909)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.433407 / 0.215209 (0.218198) 4.329945 / 2.077655 (2.252291) 2.096822 / 1.504120 (0.592702) 1.908173 / 1.541195 (0.366978) 1.967167 / 1.468490 (0.498676) 0.529207 / 4.584777 (-4.055570) 3.798424 / 3.745712 (0.052712) 3.050716 / 5.269862 (-2.219146) 1.445009 / 4.565676 (-3.120668) 0.066467 / 0.424275 (-0.357809) 0.011698 / 0.007607 (0.004090) 0.528660 / 0.226044 (0.302615) 5.282069 / 2.268929 (3.013141) 2.535501 / 55.444624 (-52.909124) 2.202856 / 6.876477 (-4.673621) 2.293225 / 2.142072 (0.151153) 0.640216 / 4.805227 (-4.165011) 0.140884 / 6.500664 (-6.359780) 0.064231 / 0.075469 (-0.011238)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.292129 / 1.841788 (-0.549659) 15.371370 / 8.074308 (7.297062) 15.114854 / 10.191392 (4.923462) 0.176870 / 0.680424 (-0.503554) 0.017380 / 0.534201 (-0.516821) 0.398156 / 0.579283 (-0.181127) 0.442277 / 0.434364 (0.007913) 0.467093 / 0.540337 (-0.073244) 0.561599 / 1.386936 (-0.825337)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009360 / 0.011353 (-0.001993) 0.006297 / 0.011008 (-0.004712) 0.133131 / 0.038508 (0.094623) 0.040261 / 0.023109 (0.017152) 0.419101 / 0.275898 (0.143203) 0.453087 / 0.323480 (0.129607) 0.007718 / 0.007986 (-0.000268) 0.005698 / 0.004328 (0.001369) 0.102261 / 0.004250 (0.098010) 0.055147 / 0.037052 (0.018095) 0.428355 / 0.258489 (0.169866) 0.505241 / 0.293841 (0.211400) 0.046745 / 0.128546 (-0.081802) 0.015559 / 0.075646 (-0.060088) 0.441775 / 0.419271 (0.022503) 0.070165 / 0.043533 (0.026632) 0.421957 / 0.255139 (0.166818) 0.445156 / 0.283200 (0.161957) 0.126321 / 0.141683 (-0.015362) 1.900486 / 1.452155 (0.448331) 2.088630 / 1.492716 (0.595913)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.260244 / 0.018006 (0.242237) 0.606317 / 0.000490 (0.605828) 0.006827 / 0.000200 (0.006627) 0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031958 / 0.037411 (-0.005453) 0.139362 / 0.014526 (0.124836) 0.148748 / 0.176557 (-0.027809) 0.226269 / 0.737135 (-0.510866) 0.161145 / 0.296338 (-0.135194)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.666287 / 0.215209 (0.451078) 6.588707 / 2.077655 (4.511053) 2.736155 / 1.504120 (1.232035) 2.329601 / 1.541195 (0.788406) 2.324991 / 1.468490 (0.856501) 0.943608 / 4.584777 (-3.641169) 6.051653 / 3.745712 (2.305941) 2.929150 / 5.269862 (-2.340711) 1.804461 / 4.565676 (-2.761216) 0.113302 / 0.424275 (-0.310973) 0.015245 / 0.007607 (0.007638) 0.827029 / 0.226044 (0.600984) 8.211536 / 2.268929 (5.942608) 3.445231 / 55.444624 (-51.999393) 2.756728 / 6.876477 (-4.119748) 2.904039 / 2.142072 (0.761966) 1.162339 / 4.805227 (-3.642888) 0.231168 / 6.500664 (-6.269496) 0.089038 / 0.075469 (0.013569)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.640619 / 1.841788 (-0.201169) 20.034157 / 8.074308 (11.959849) 22.346006 / 10.191392 (12.154614) 0.255300 / 0.680424 (-0.425124) 0.031452 / 0.534201 (-0.502749) 0.563290 / 0.579283 (-0.015993) 0.653556 / 0.434364 (0.219192) 0.687663 / 0.540337 (0.147326) 0.816432 / 1.386936 (-0.570504)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010340 / 0.011353 (-0.001013) 0.006245 / 0.011008 (-0.004764) 0.128012 / 0.038508 (0.089504) 0.041799 / 0.023109 (0.018690) 0.533340 / 0.275898 (0.257442) 0.592243 / 0.323480 (0.268763) 0.009256 / 0.007986 (0.001271) 0.005310 / 0.004328 (0.000982) 0.110973 / 0.004250 (0.106722) 0.065465 / 0.037052 (0.028412) 0.533845 / 0.258489 (0.275356) 0.602190 / 0.293841 (0.308349) 0.060245 / 0.128546 (-0.068301) 0.016954 / 0.075646 (-0.058693) 0.119727 / 0.419271 (-0.299545) 0.064628 / 0.043533 (0.021095) 0.558229 / 0.255139 (0.303090) 0.563696 / 0.283200 (0.280496) 0.137225 / 0.141683 (-0.004458) 2.038605 / 1.452155 (0.586451) 2.158655 / 1.492716 (0.665939)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.327067 / 0.018006 (0.309061) 0.628812 / 0.000490 (0.628323) 0.010259 / 0.000200 (0.010059) 0.000123 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037023 / 0.037411 (-0.000388) 0.142462 / 0.014526 (0.127936) 0.158165 / 0.176557 (-0.018392) 0.220808 / 0.737135 (-0.516328) 0.163608 / 0.296338 (-0.132731)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.776119 / 0.215209 (0.560910) 7.813044 / 2.077655 (5.735389) 3.610901 / 1.504120 (2.106781) 3.195144 / 1.541195 (1.653950) 3.218245 / 1.468490 (1.749755) 1.092732 / 4.584777 (-3.492045) 5.965526 / 3.745712 (2.219813) 2.914683 / 5.269862 (-2.355179) 1.848397 / 4.565676 (-2.717280) 0.114436 / 0.424275 (-0.309839) 0.014794 / 0.007607 (0.007187) 0.887141 / 0.226044 (0.661096) 9.009743 / 2.268929 (6.740815) 4.180143 / 55.444624 (-51.264481) 3.452194 / 6.876477 (-3.424283) 3.493520 / 2.142072 (1.351448) 1.233327 / 4.805227 (-3.571900) 0.235390 / 6.500664 (-6.265274) 0.099544 / 0.075469 (0.024075)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.853482 / 1.841788 (0.011694) 20.071177 / 8.074308 (11.996869) 24.507618 / 10.191392 (14.316226) 0.260164 / 0.680424 (-0.420260) 0.028433 / 0.534201 (-0.505768) 0.549181 / 0.579283 (-0.030102) 0.650069 / 0.434364 (0.215705) 0.629541 / 0.540337 (0.089203) 0.808932 / 1.386936 (-0.578004)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009537 / 0.011353 (-0.001816) 0.006036 / 0.011008 (-0.004972) 0.141210 / 0.038508 (0.102701) 0.037493 / 0.023109 (0.014384) 0.404285 / 0.275898 (0.128386) 0.458906 / 0.323480 (0.135427) 0.007224 / 0.007986 (-0.000761) 0.005148 / 0.004328 (0.000819) 0.103889 / 0.004250 (0.099639) 0.048877 / 0.037052 (0.011824) 0.413220 / 0.258489 (0.154731) 0.458153 / 0.293841 (0.164312) 0.046008 / 0.128546 (-0.082538) 0.015116 / 0.075646 (-0.060531) 0.439836 / 0.419271 (0.020565) 0.067527 / 0.043533 (0.023994) 0.435794 / 0.255139 (0.180656) 0.451687 / 0.283200 (0.168487) 0.121274 / 0.141683 (-0.020409) 1.950199 / 1.452155 (0.498044) 2.035589 / 1.492716 (0.542873)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.247056 / 0.018006 (0.229050) 0.550348 / 0.000490 (0.549858) 0.005504 / 0.000200 (0.005305) 0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032171 / 0.037411 (-0.005240) 0.135983 / 0.014526 (0.121457) 0.149587 / 0.176557 (-0.026970) 0.233414 / 0.737135 (-0.503722) 0.152598 / 0.296338 (-0.143740)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.634813 / 0.215209 (0.419604) 6.453619 / 2.077655 (4.375964) 2.582070 / 1.504120 (1.077951) 2.214292 / 1.541195 (0.673097) 2.220012 / 1.468490 (0.751522) 0.987374 / 4.584777 (-3.597403) 5.543760 / 3.745712 (1.798047) 2.808865 / 5.269862 (-2.460996) 1.714713 / 4.565676 (-2.850963) 0.111016 / 0.424275 (-0.313259) 0.014688 / 0.007607 (0.007081) 0.842542 / 0.226044 (0.616498) 8.414336 / 2.268929 (6.145407) 3.501021 / 55.444624 (-51.943604) 2.665335 / 6.876477 (-4.211142) 2.843706 / 2.142072 (0.701633) 1.196398 / 4.805227 (-3.608829) 0.245508 / 6.500664 (-6.255156) 0.086970 / 0.075469 (0.011501)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.590244 / 1.841788 (-0.251544) 18.694141 / 8.074308 (10.619833) 21.752463 / 10.191392 (11.561071) 0.264511 / 0.680424 (-0.415913) 0.028713 / 0.534201 (-0.505488) 0.531102 / 0.579283 (-0.048181) 0.626302 / 0.434364 (0.191938) 0.624541 / 0.540337 (0.084203) 0.745745 / 1.386936 (-0.641191)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010097 / 0.011353 (-0.001256) 0.005558 / 0.011008 (-0.005451) 0.111326 / 0.038508 (0.072818) 0.036465 / 0.023109 (0.013356) 0.472116 / 0.275898 (0.196218) 0.524479 / 0.323480 (0.200999) 0.007466 / 0.007986 (-0.000520) 0.005440 / 0.004328 (0.001112) 0.103482 / 0.004250 (0.099231) 0.053217 / 0.037052 (0.016165) 0.476685 / 0.258489 (0.218196) 0.554011 / 0.293841 (0.260170) 0.047157 / 0.128546 (-0.081390) 0.015895 / 0.075646 (-0.059751) 0.115997 / 0.419271 (-0.303274) 0.062290 / 0.043533 (0.018758) 0.474166 / 0.255139 (0.219027) 0.498854 / 0.283200 (0.215655) 0.121798 / 0.141683 (-0.019885) 1.956583 / 1.452155 (0.504428) 2.069620 / 1.492716 (0.576904)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.278637 / 0.018006 (0.260631) 0.555295 / 0.000490 (0.554805) 0.007401 / 0.000200 (0.007201) 0.000121 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033576 / 0.037411 (-0.003835) 0.136479 / 0.014526 (0.121954) 0.153960 / 0.176557 (-0.022597) 0.203422 / 0.737135 (-0.533713) 0.154159 / 0.296338 (-0.142180)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.672561 / 0.215209 (0.457352) 6.956675 / 2.077655 (4.879020) 3.063636 / 1.504120 (1.559516) 2.668256 / 1.541195 (1.127061) 2.794793 / 1.468490 (1.326303) 0.964242 / 4.584777 (-3.620535) 5.785992 / 3.745712 (2.040279) 2.850079 / 5.269862 (-2.419782) 1.782491 / 4.565676 (-2.783186) 0.114859 / 0.424275 (-0.309416) 0.015229 / 0.007607 (0.007622) 0.858406 / 0.226044 (0.632362) 8.646296 / 2.268929 (6.377367) 3.842133 / 55.444624 (-51.602492) 3.180017 / 6.876477 (-3.696460) 3.241315 / 2.142072 (1.099243) 1.248988 / 4.805227 (-3.556239) 0.235075 / 6.500664 (-6.265589) 0.087192 / 0.075469 (0.011723)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.783877 / 1.841788 (-0.057910) 19.477223 / 8.074308 (11.402914) 22.926734 / 10.191392 (12.735342) 0.246970 / 0.680424 (-0.433454) 0.026386 / 0.534201 (-0.507815) 0.517599 / 0.579283 (-0.061684) 0.626504 / 0.434364 (0.192140) 0.606943 / 0.540337 (0.066606) 0.739115 / 1.386936 (-0.647821)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008085 / 0.011353 (-0.003268) 0.005568 / 0.011008 (-0.005440) 0.119674 / 0.038508 (0.081166) 0.040452 / 0.023109 (0.017343) 0.360288 / 0.275898 (0.084390) 0.409448 / 0.323480 (0.085968) 0.007281 / 0.007986 (-0.000705) 0.004931 / 0.004328 (0.000602) 0.089956 / 0.004250 (0.085706) 0.056088 / 0.037052 (0.019036) 0.384708 / 0.258489 (0.126219) 0.423506 / 0.293841 (0.129665) 0.033280 / 0.128546 (-0.095266) 0.010696 / 0.075646 (-0.064951) 0.394851 / 0.419271 (-0.024421) 0.058412 / 0.043533 (0.014879) 0.361514 / 0.255139 (0.106375) 0.399121 / 0.283200 (0.115921) 0.117927 / 0.141683 (-0.023756) 1.791499 / 1.452155 (0.339344) 1.889000 / 1.492716 (0.396284)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.253324 / 0.018006 (0.235318) 0.536151 / 0.000490 (0.535661) 0.010450 / 0.000200 (0.010250) 0.000171 / 0.000054 (0.000117)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034646 / 0.037411 (-0.002765) 0.145999 / 0.014526 (0.131473) 0.153793 / 0.176557 (-0.022763) 0.232871 / 0.737135 (-0.504265) 0.161151 / 0.296338 (-0.135188)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.471407 / 0.215209 (0.256197) 4.715702 / 2.077655 (2.638047) 2.228939 / 1.504120 (0.724819) 2.008511 / 1.541195 (0.467317) 2.135182 / 1.468490 (0.666692) 0.620720 / 4.584777 (-3.964057) 4.960731 / 3.745712 (1.215019) 2.222469 / 5.269862 (-3.047393) 1.284467 / 4.565676 (-3.281209) 0.077931 / 0.424275 (-0.346344) 0.013935 / 0.007607 (0.006328) 0.593164 / 0.226044 (0.367120) 5.940829 / 2.268929 (3.671900) 2.664277 / 55.444624 (-52.780347) 2.290655 / 6.876477 (-4.585822) 2.496664 / 2.142072 (0.354592) 0.759166 / 4.805227 (-4.046061) 0.168011 / 6.500664 (-6.332653) 0.077993 / 0.075469 (0.002524)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.440663 / 1.841788 (-0.401125) 19.105377 / 8.074308 (11.031069) 16.068118 / 10.191392 (5.876726) 0.193024 / 0.680424 (-0.487400) 0.022348 / 0.534201 (-0.511853) 0.517454 / 0.579283 (-0.061829) 0.528072 / 0.434364 (0.093708) 0.565293 / 0.540337 (0.024955) 0.676578 / 1.386936 (-0.710358)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008089 / 0.011353 (-0.003264) 0.005287 / 0.011008 (-0.005721) 0.087964 / 0.038508 (0.049456) 0.041548 / 0.023109 (0.018439) 0.437733 / 0.275898 (0.161835) 0.487878 / 0.323480 (0.164398) 0.006898 / 0.007986 (-0.001087) 0.004649 / 0.004328 (0.000320) 0.086982 / 0.004250 (0.082732) 0.056874 / 0.037052 (0.019822) 0.437397 / 0.258489 (0.178908) 0.490636 / 0.293841 (0.196795) 0.033550 / 0.128546 (-0.094997) 0.010430 / 0.075646 (-0.065216) 0.096076 / 0.419271 (-0.323196) 0.054028 / 0.043533 (0.010495) 0.450262 / 0.255139 (0.195123) 0.465566 / 0.283200 (0.182366) 0.119987 / 0.141683 (-0.021696) 1.764428 / 1.452155 (0.312273) 1.841547 / 1.492716 (0.348831)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.271427 / 0.018006 (0.253420) 0.506386 / 0.000490 (0.505896) 0.001213 / 0.000200 (0.001013) 0.000125 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036159 / 0.037411 (-0.001253) 0.140578 / 0.014526 (0.126053) 0.147517 / 0.176557 (-0.029040) 0.206215 / 0.737135 (-0.530921) 0.152560 / 0.296338 (-0.143779)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.522833 / 0.215209 (0.307624) 5.215732 / 2.077655 (3.138077) 2.553406 / 1.504120 (1.049286) 2.344815 / 1.541195 (0.803620) 2.422377 / 1.468490 (0.953886) 0.631197 / 4.584777 (-3.953580) 4.906216 / 3.745712 (1.160504) 2.212923 / 5.269862 (-3.056938) 1.352937 / 4.565676 (-3.212740) 0.079141 / 0.424275 (-0.345135) 0.013691 / 0.007607 (0.006084) 0.634939 / 0.226044 (0.408895) 6.578770 / 2.268929 (4.309842) 3.080339 / 55.444624 (-52.364286) 2.710243 / 6.876477 (-4.166234) 2.740476 / 2.142072 (0.598404) 0.783610 / 4.805227 (-4.021617) 0.171589 / 6.500664 (-6.329075) 0.077311 / 0.075469 (0.001842)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.584847 / 1.841788 (-0.256941) 19.510132 / 8.074308 (11.435824) 18.074572 / 10.191392 (7.883180) 0.173494 / 0.680424 (-0.506930) 0.021149 / 0.534201 (-0.513052) 0.469026 / 0.579283 (-0.110258) 0.518463 / 0.434364 (0.084099) 0.550363 / 0.540337 (0.010026) 0.667087 / 1.386936 (-0.719849)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007144 / 0.011353 (-0.004209) 0.004783 / 0.011008 (-0.006225) 0.103991 / 0.038508 (0.065483) 0.039098 / 0.023109 (0.015989) 0.319851 / 0.275898 (0.043952) 0.356104 / 0.323480 (0.032625) 0.007077 / 0.007986 (-0.000909) 0.004188 / 0.004328 (-0.000141) 0.078360 / 0.004250 (0.074109) 0.050951 / 0.037052 (0.013899) 0.321791 / 0.258489 (0.063302) 0.356123 / 0.293841 (0.062283) 0.028967 / 0.128546 (-0.099579) 0.009091 / 0.075646 (-0.066555) 0.355265 / 0.419271 (-0.064007) 0.052521 / 0.043533 (0.008988) 0.317333 / 0.255139 (0.062194) 0.340747 / 0.283200 (0.057547) 0.104354 / 0.141683 (-0.037329) 1.522791 / 1.452155 (0.070636) 1.579835 / 1.492716 (0.087118)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.260539 / 0.018006 (0.242532) 0.454230 / 0.000490 (0.453740) 0.036588 / 0.000200 (0.036388) 0.000289 / 0.000054 (0.000235)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028375 / 0.037411 (-0.009036) 0.118939 / 0.014526 (0.104413) 0.126553 / 0.176557 (-0.050004) 0.184596 / 0.737135 (-0.552539) 0.130583 / 0.296338 (-0.165755)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417353 / 0.215209 (0.202144) 4.171595 / 2.077655 (2.093940) 1.855096 / 1.504120 (0.350976) 1.673941 / 1.541195 (0.132747) 1.761370 / 1.468490 (0.292880) 0.544081 / 4.584777 (-4.040696) 3.851877 / 3.745712 (0.106165) 1.896661 / 5.269862 (-3.373200) 1.093303 / 4.565676 (-3.472373) 0.067967 / 0.424275 (-0.356308) 0.012313 / 0.007607 (0.004706) 0.532316 / 0.226044 (0.306272) 5.336016 / 2.268929 (3.067087) 2.344780 / 55.444624 (-53.099845) 1.993909 / 6.876477 (-4.882568) 2.167324 / 2.142072 (0.025251) 0.670334 / 4.805227 (-4.134893) 0.147705 / 6.500664 (-6.352959) 0.067634 / 0.075469 (-0.007835)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.251005 / 1.841788 (-0.590783) 15.405531 / 8.074308 (7.331223) 14.197019 / 10.191392 (4.005627) 0.144230 / 0.680424 (-0.536193) 0.018352 / 0.534201 (-0.515849) 0.427536 / 0.579283 (-0.151748) 0.433135 / 0.434364 (-0.001229) 0.502624 / 0.540337 (-0.037713) 0.612312 / 1.386936 (-0.774624)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007011 / 0.011353 (-0.004342) 0.004857 / 0.011008 (-0.006151) 0.077797 / 0.038508 (0.039289) 0.035411 / 0.023109 (0.012302) 0.368234 / 0.275898 (0.092336) 0.408359 / 0.323480 (0.084879) 0.005883 / 0.007986 (-0.002102) 0.004311 / 0.004328 (-0.000017) 0.077216 / 0.004250 (0.072966) 0.052062 / 0.037052 (0.015010) 0.368502 / 0.258489 (0.110013) 0.428681 / 0.293841 (0.134840) 0.028889 / 0.128546 (-0.099657) 0.009146 / 0.075646 (-0.066501) 0.085515 / 0.419271 (-0.333756) 0.050216 / 0.043533 (0.006683) 0.359562 / 0.255139 (0.104423) 0.378335 / 0.283200 (0.095135) 0.106351 / 0.141683 (-0.035332) 1.538943 / 1.452155 (0.086788) 1.663572 / 1.492716 (0.170855)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.216917 / 0.018006 (0.198911) 0.444130 / 0.000490 (0.443641) 0.002640 / 0.000200 (0.002440) 0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032509 / 0.037411 (-0.004902) 0.123955 / 0.014526 (0.109430) 0.133236 / 0.176557 (-0.043321) 0.187408 / 0.737135 (-0.549727) 0.136696 / 0.296338 (-0.159643)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.443714 / 0.215209 (0.228505) 4.416973 / 2.077655 (2.339318) 2.145279 / 1.504120 (0.641159) 1.946669 / 1.541195 (0.405474) 2.044105 / 1.468490 (0.575614) 0.534463 / 4.584777 (-4.050314) 3.824926 / 3.745712 (0.079214) 3.151796 / 5.269862 (-2.118066) 1.497513 / 4.565676 (-3.068164) 0.066799 / 0.424275 (-0.357476) 0.012408 / 0.007607 (0.004801) 0.544182 / 0.226044 (0.318138) 5.419403 / 2.268929 (3.150474) 2.605191 / 55.444624 (-52.839433) 2.285354 / 6.876477 (-4.591123) 2.359520 / 2.142072 (0.217448) 0.655489 / 4.805227 (-4.149738) 0.143496 / 6.500664 (-6.357168) 0.066782 / 0.075469 (-0.008687)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.329370 / 1.841788 (-0.512418) 16.058019 / 8.074308 (7.983711) 15.119769 / 10.191392 (4.928377) 0.147967 / 0.680424 (-0.532457) 0.018360 / 0.534201 (-0.515841) 0.436847 / 0.579283 (-0.142436) 0.435136 / 0.434364 (0.000773) 0.507176 / 0.540337 (-0.033161) 0.610627 / 1.386936 (-0.776309)

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006425 / 0.011353 (-0.004927) 0.003710 / 0.011008 (-0.007298) 0.102072 / 0.038508 (0.063564) 0.033974 / 0.023109 (0.010865) 0.273146 / 0.275898 (-0.002752) 0.313254 / 0.323480 (-0.010226) 0.004889 / 0.007986 (-0.003096) 0.004803 / 0.004328 (0.000475) 0.067359 / 0.004250 (0.063109) 0.040281 / 0.037052 (0.003228) 0.302106 / 0.258489 (0.043617) 0.318039 / 0.293841 (0.024198) 0.028839 / 0.128546 (-0.099707) 0.008726 / 0.075646 (-0.066921) 0.322532 / 0.419271 (-0.096739) 0.048845 / 0.043533 (0.005312) 0.299836 / 0.255139 (0.044697) 0.300983 / 0.283200 (0.017784) 0.103384 / 0.141683 (-0.038299) 1.417245 / 1.452155 (-0.034910) 1.538819 / 1.492716 (0.046102)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.219798 / 0.018006 (0.201792) 0.442297 / 0.000490 (0.441807) 0.013792 / 0.000200 (0.013592) 0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024996 / 0.037411 (-0.012416) 0.098558 / 0.014526 (0.084032) 0.116423 / 0.176557 (-0.060133) 0.163481 / 0.737135 (-0.573654) 0.115031 / 0.296338 (-0.181308)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.392411 / 0.215209 (0.177202) 4.025992 / 2.077655 (1.948337) 1.850809 / 1.504120 (0.346690) 1.668330 / 1.541195 (0.127136) 1.627041 / 1.468490 (0.158551) 0.510721 / 4.584777 (-4.074055) 3.841318 / 3.745712 (0.095606) 3.416979 / 5.269862 (-1.852883) 1.640796 / 4.565676 (-2.924880) 0.061968 / 0.424275 (-0.362307) 0.010281 / 0.007607 (0.002674) 0.485592 / 0.226044 (0.259548) 4.872205 / 2.268929 (2.603277) 2.146753 / 55.444624 (-53.297871) 1.832087 / 6.876477 (-5.044390) 1.920928 / 2.142072 (-0.221144) 0.606363 / 4.805227 (-4.198864) 0.134351 / 6.500664 (-6.366313) 0.057583 / 0.075469 (-0.017886)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.153048 / 1.841788 (-0.688739) 14.165743 / 8.074308 (6.091435) 12.237798 / 10.191392 (2.046406) 0.159815 / 0.680424 (-0.520608) 0.018226 / 0.534201 (-0.515975) 0.372390 / 0.579283 (-0.206893) 0.396552 / 0.434364 (-0.037811) 0.439445 / 0.540337 (-0.100892) 0.521924 / 1.386936 (-0.865012)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006162 / 0.011353 (-0.005191) 0.004006 / 0.011008 (-0.007002) 0.067226 / 0.038508 (0.028718) 0.030285 / 0.023109 (0.007176) 0.361220 / 0.275898 (0.085322) 0.386783 / 0.323480 (0.063303) 0.005202 / 0.007986 (-0.002784) 0.003453 / 0.004328 (-0.000876) 0.068299 / 0.004250 (0.064048) 0.041433 / 0.037052 (0.004381) 0.360222 / 0.258489 (0.101733) 0.399327 / 0.293841 (0.105486) 0.026066 / 0.128546 (-0.102480) 0.008025 / 0.075646 (-0.067621) 0.079588 / 0.419271 (-0.339683) 0.042616 / 0.043533 (-0.000917) 0.347639 / 0.255139 (0.092500) 0.386092 / 0.283200 (0.102893) 0.100869 / 0.141683 (-0.040814) 1.386901 / 1.452155 (-0.065254) 1.471523 / 1.492716 (-0.021193)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.217020 / 0.018006 (0.199014) 0.431033 / 0.000490 (0.430543) 0.002902 / 0.000200 (0.002702) 0.000092 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027396 / 0.037411 (-0.010015) 0.114154 / 0.014526 (0.099629) 0.117918 / 0.176557 (-0.058638) 0.173342 / 0.737135 (-0.563794) 0.125812 / 0.296338 (-0.170526)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.424843 / 0.215209 (0.209634) 4.324828 / 2.077655 (2.247174) 2.188263 / 1.504120 (0.684143) 1.912288 / 1.541195 (0.371094) 2.011621 / 1.468490 (0.543131) 0.560944 / 4.584777 (-4.023833) 3.975047 / 3.745712 (0.229335) 3.130242 / 5.269862 (-2.139619) 1.667902 / 4.565676 (-2.897775) 0.062245 / 0.424275 (-0.362030) 0.011300 / 0.007607 (0.003692) 0.498571 / 0.226044 (0.272527) 5.024887 / 2.268929 (2.755958) 2.482967 / 55.444624 (-52.961657) 2.216125 / 6.876477 (-4.660352) 2.175856 / 2.142072 (0.033783) 0.615207 / 4.805227 (-4.190021) 0.133808 / 6.500664 (-6.366856) 0.058681 / 0.075469 (-0.016788)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.370150 / 1.841788 (-0.471637) 14.580907 / 8.074308 (6.506599) 14.209955 / 10.191392 (4.018563) 0.139738 / 0.680424 (-0.540686) 0.018722 / 0.534201 (-0.515479) 0.375755 / 0.579283 (-0.203528) 0.428335 / 0.434364 (-0.006029) 0.438957 / 0.540337 (-0.101380) 0.541130 / 1.386936 (-0.845806)

@Rocketknight1
Copy link
Member Author

@alvarobartt @lhoestq This should be ready for re-review. I've rebased it on the recent PR to allow batch_size=None, and it should also support unbatched loading now.

Having a variety of different methods like this is annoying, but once our minimum Python version is 3.8 I can go back and clear a lot of this out!

@@ -173,6 +174,21 @@ def dataset_to_tf(
else:
raise ImportError("Called a Tensorflow-specific function but Tensorflow is not installed.")

# TODO Matt: When our minimum Python version is 3.8 or higher, we can delete all of this and move everything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt, is datasets going to drop Python 3.7 support due to its upcoming EOL? Because it will happen by the end of the month in case we want to wait and set the minimum version to 3.8, even though I assume some users may still be using 3.7?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will probably depend on what transformers does

@alvarobartt
Copy link
Member

LGTM @Rocketknight1! I may run some tests during the weekend to compare performances with the current approach in case that's useful 😄

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :) feel free to run some tests before merging though

@Rocketknight1
Copy link
Member Author

@alvarobartt I'll probably merge now, just to avoid the major memory usage issues we currently have! Feel free to run the comparisons before/after the commit.

@Rocketknight1
Copy link
Member Author

And yes, hopefully Py3.7 goes EOL and we make Py3.8 the minimum soon to resolve this.

@Rocketknight1 Rocketknight1 merged commit 6ee61e6 into main Jun 8, 2023
@Rocketknight1 Rocketknight1 deleted the reduce_to_tf_dataset_memory_usage branch June 8, 2023 16:32
@alvarobartt
Copy link
Member

@alvarobartt I'll probably merge now, just to avoid the major memory usage issues we currently have! Feel free to run the comparisons before/after the commit.

I'll ping you back with the comparison this weekend! 🤗

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007102 / 0.011353 (-0.004251) 0.004713 / 0.011008 (-0.006295) 0.102391 / 0.038508 (0.063883) 0.038363 / 0.023109 (0.015253) 0.330843 / 0.275898 (0.054945) 0.365290 / 0.323480 (0.041810) 0.006389 / 0.007986 (-0.001596) 0.004287 / 0.004328 (-0.000041) 0.078710 / 0.004250 (0.074460) 0.051974 / 0.037052 (0.014922) 0.333163 / 0.258489 (0.074674) 0.371016 / 0.293841 (0.077176) 0.028412 / 0.128546 (-0.100134) 0.009350 / 0.075646 (-0.066296) 0.351673 / 0.419271 (-0.067599) 0.051879 / 0.043533 (0.008347) 0.323769 / 0.255139 (0.068630) 0.342994 / 0.283200 (0.059794) 0.107347 / 0.141683 (-0.034336) 1.585641 / 1.452155 (0.133487) 1.679408 / 1.492716 (0.186691)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.251772 / 0.018006 (0.233766) 0.580570 / 0.000490 (0.580081) 0.008346 / 0.000200 (0.008147) 0.000113 / 0.000054 (0.000059)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028740 / 0.037411 (-0.008672) 0.117707 / 0.014526 (0.103182) 0.126397 / 0.176557 (-0.050160) 0.183823 / 0.737135 (-0.553312) 0.132272 / 0.296338 (-0.164066)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.428428 / 0.215209 (0.213219) 4.263983 / 2.077655 (2.186329) 2.012477 / 1.504120 (0.508357) 1.812453 / 1.541195 (0.271259) 1.889282 / 1.468490 (0.420792) 0.534459 / 4.584777 (-4.050318) 3.719460 / 3.745712 (-0.026252) 1.958039 / 5.269862 (-3.311823) 1.078166 / 4.565676 (-3.487510) 0.067902 / 0.424275 (-0.356373) 0.012479 / 0.007607 (0.004872) 0.532071 / 0.226044 (0.306026) 5.343323 / 2.268929 (3.074394) 2.478577 / 55.444624 (-52.966047) 2.146067 / 6.876477 (-4.730409) 2.324783 / 2.142072 (0.182710) 0.655925 / 4.805227 (-4.149302) 0.145578 / 6.500664 (-6.355086) 0.068044 / 0.075469 (-0.007425)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.254036 / 1.841788 (-0.587752) 15.199639 / 8.074308 (7.125331) 13.851406 / 10.191392 (3.660014) 0.168760 / 0.680424 (-0.511664) 0.017807 / 0.534201 (-0.516394) 0.425857 / 0.579283 (-0.153426) 0.413098 / 0.434364 (-0.021266) 0.497433 / 0.540337 (-0.042905) 0.599273 / 1.386936 (-0.787663)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007044 / 0.011353 (-0.004309) 0.005036 / 0.011008 (-0.005972) 0.080307 / 0.038508 (0.041798) 0.035926 / 0.023109 (0.012817) 0.402026 / 0.275898 (0.126128) 0.444185 / 0.323480 (0.120705) 0.006228 / 0.007986 (-0.001758) 0.004481 / 0.004328 (0.000153) 0.080223 / 0.004250 (0.075972) 0.055385 / 0.037052 (0.018333) 0.405674 / 0.258489 (0.147184) 0.461574 / 0.293841 (0.167733) 0.029237 / 0.128546 (-0.099309) 0.009249 / 0.075646 (-0.066398) 0.086215 / 0.419271 (-0.333056) 0.048512 / 0.043533 (0.004979) 0.401374 / 0.255139 (0.146235) 0.418274 / 0.283200 (0.135074) 0.107994 / 0.141683 (-0.033689) 1.560504 / 1.452155 (0.108350) 1.669651 / 1.492716 (0.176935)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.275393 / 0.018006 (0.257387) 0.573688 / 0.000490 (0.573199) 0.007236 / 0.000200 (0.007036) 0.000153 / 0.000054 (0.000099)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033387 / 0.037411 (-0.004024) 0.125027 / 0.014526 (0.110501) 0.138601 / 0.176557 (-0.037956) 0.191820 / 0.737135 (-0.545315) 0.141022 / 0.296338 (-0.155317)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.456382 / 0.215209 (0.241173) 4.559197 / 2.077655 (2.481542) 2.263333 / 1.504120 (0.759213) 2.073151 / 1.541195 (0.531956) 2.185314 / 1.468490 (0.716824) 0.540230 / 4.584777 (-4.044547) 3.934984 / 3.745712 (0.189272) 1.980895 / 5.269862 (-3.288966) 1.101440 / 4.565676 (-3.464237) 0.068255 / 0.424275 (-0.356020) 0.012605 / 0.007607 (0.004997) 0.560695 / 0.226044 (0.334650) 5.588877 / 2.268929 (3.319948) 2.756690 / 55.444624 (-52.687935) 2.427774 / 6.876477 (-4.448702) 2.548903 / 2.142072 (0.406831) 0.657177 / 4.805227 (-4.148050) 0.147645 / 6.500664 (-6.353019) 0.069216 / 0.075469 (-0.006253)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.319813 / 1.841788 (-0.521975) 15.882227 / 8.074308 (7.807919) 15.324481 / 10.191392 (5.133089) 0.193708 / 0.680424 (-0.486716) 0.018264 / 0.534201 (-0.515937) 0.432594 / 0.579283 (-0.146689) 0.437063 / 0.434364 (0.002699) 0.512297 / 0.540337 (-0.028040) 0.617469 / 1.386936 (-0.769467)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_tf_dataset consumes too much memory
5 participants