Prepare tests for hfh 0.14 #5788

Wauplin · 2023-04-24T12:13:03Z

Related to the coming release of huggingface_hub==0.14.0. It will break some internal tests. The PR fixes these tests. Let's double-check the CI but I expect the fixed tests to be running fine with both hfh<=0.13.4 and hfh==0.14. Worth case scenario, existing PRs will have to be rebased once this fix is merged.

See related discussion (private slack).

cc @lhoestq

HuggingFaceDocBuilderDev · 2023-04-24T12:17:59Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-04-24T12:18:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007343 / 0.011353 (-0.004010)	0.005145 / 0.011008 (-0.005863)	0.099820 / 0.038508 (0.061312)	0.033487 / 0.023109 (0.010378)	0.313069 / 0.275898 (0.037171)	0.335420 / 0.323480 (0.011940)	0.005959 / 0.007986 (-0.002027)	0.005373 / 0.004328 (0.001044)	0.076568 / 0.004250 (0.072317)	0.048702 / 0.037052 (0.011650)	0.322957 / 0.258489 (0.064468)	0.363044 / 0.293841 (0.069203)	0.035070 / 0.128546 (-0.093476)	0.012029 / 0.075646 (-0.063618)	0.334664 / 0.419271 (-0.084607)	0.050549 / 0.043533 (0.007017)	0.310113 / 0.255139 (0.054974)	0.324405 / 0.283200 (0.041205)	0.097596 / 0.141683 (-0.044087)	1.440741 / 1.452155 (-0.011414)	1.531194 / 1.492716 (0.038478)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220799 / 0.018006 (0.202793)	0.438158 / 0.000490 (0.437668)	0.007737 / 0.000200 (0.007537)	0.000082 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026888 / 0.037411 (-0.010523)	0.106281 / 0.014526 (0.091755)	0.117419 / 0.176557 (-0.059138)	0.179144 / 0.737135 (-0.557992)	0.122477 / 0.296338 (-0.173861)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.412667 / 0.215209 (0.197458)	4.108784 / 2.077655 (2.031129)	1.834300 / 1.504120 (0.330180)	1.627256 / 1.541195 (0.086061)	1.691036 / 1.468490 (0.222546)	0.713405 / 4.584777 (-3.871372)	3.839262 / 3.745712 (0.093550)	2.108453 / 5.269862 (-3.161408)	1.340740 / 4.565676 (-3.224936)	0.087776 / 0.424275 (-0.336499)	0.012730 / 0.007607 (0.005123)	0.505323 / 0.226044 (0.279279)	5.085176 / 2.268929 (2.816247)	2.307165 / 55.444624 (-53.137459)	1.936771 / 6.876477 (-4.939706)	2.097391 / 2.142072 (-0.044681)	0.856215 / 4.805227 (-3.949012)	0.171826 / 6.500664 (-6.328838)	0.066603 / 0.075469 (-0.008866)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.202126 / 1.841788 (-0.639661)	15.173598 / 8.074308 (7.099290)	15.012645 / 10.191392 (4.821253)	0.162187 / 0.680424 (-0.518237)	0.017462 / 0.534201 (-0.516739)	0.423895 / 0.579283 (-0.155388)	0.432010 / 0.434364 (-0.002354)	0.503234 / 0.540337 (-0.037104)	0.598948 / 1.386936 (-0.787988)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007099 / 0.011353 (-0.004254)	0.005167 / 0.011008 (-0.005841)	0.075551 / 0.038508 (0.037043)	0.033050 / 0.023109 (0.009940)	0.339629 / 0.275898 (0.063731)	0.380486 / 0.323480 (0.057006)	0.005776 / 0.007986 (-0.002209)	0.004029 / 0.004328 (-0.000299)	0.075074 / 0.004250 (0.070823)	0.046709 / 0.037052 (0.009656)	0.340203 / 0.258489 (0.081714)	0.380849 / 0.293841 (0.087008)	0.035027 / 0.128546 (-0.093519)	0.012226 / 0.075646 (-0.063420)	0.087525 / 0.419271 (-0.331747)	0.049361 / 0.043533 (0.005828)	0.341854 / 0.255139 (0.086715)	0.359590 / 0.283200 (0.076390)	0.100102 / 0.141683 (-0.041581)	1.482759 / 1.452155 (0.030605)	1.569905 / 1.492716 (0.077189)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.213615 / 0.018006 (0.195609)	0.441117 / 0.000490 (0.440628)	0.004932 / 0.000200 (0.004732)	0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031313 / 0.037411 (-0.006098)	0.110191 / 0.014526 (0.095665)	0.125320 / 0.176557 (-0.051237)	0.177658 / 0.737135 (-0.559477)	0.127928 / 0.296338 (-0.168410)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426952 / 0.215209 (0.211743)	4.247731 / 2.077655 (2.170076)	2.107318 / 1.504120 (0.603198)	1.843845 / 1.541195 (0.302650)	1.894822 / 1.468490 (0.426332)	0.696232 / 4.584777 (-3.888545)	3.826516 / 3.745712 (0.080804)	2.126688 / 5.269862 (-3.143174)	1.327062 / 4.565676 (-3.238615)	0.085693 / 0.424275 (-0.338582)	0.012226 / 0.007607 (0.004619)	0.521904 / 0.226044 (0.295859)	5.219798 / 2.268929 (2.950869)	2.524908 / 55.444624 (-52.919716)	2.212078 / 6.876477 (-4.664399)	2.373944 / 2.142072 (0.231871)	0.833846 / 4.805227 (-3.971381)	0.169639 / 6.500664 (-6.331025)	0.064538 / 0.075469 (-0.010931)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.254930 / 1.841788 (-0.586858)	15.585277 / 8.074308 (7.510969)	14.762857 / 10.191392 (4.571465)	0.146959 / 0.680424 (-0.533465)	0.017451 / 0.534201 (-0.516750)	0.424469 / 0.579283 (-0.154814)	0.422359 / 0.434364 (-0.012004)	0.489930 / 0.540337 (-0.050408)	0.595856 / 1.386936 (-0.791080)

albertvillanova

Thanks for taking care of the fixes in our CI.

albertvillanova · 2023-04-25T07:55:01Z

.github/workflows/ci.yml

@@ -7,6 +7,7 @@ on:
  push:
    branches:
      - main
+      - test-hfh*


I think this can be removed.

albertvillanova

EDIT:
~~Maybe we should first be sure it works fine with hfh<0.14.~~

Wauplin · 2023-04-25T08:17:29Z

@albertvillanova thanks for the review. As you prefer for the github CI config. I just took it from @lhoestq's branch when testing hfh==0.14.0. I think it's still relevant for next releases. In any case, I let you handle merging the PR :)

github-actions · 2023-04-25T14:06:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008371 / 0.011353 (-0.002982)	0.005210 / 0.011008 (-0.005798)	0.105639 / 0.038508 (0.067131)	0.045903 / 0.023109 (0.022794)	0.391231 / 0.275898 (0.115333)	0.438824 / 0.323480 (0.115345)	0.006270 / 0.007986 (-0.001715)	0.005950 / 0.004328 (0.001621)	0.079685 / 0.004250 (0.075434)	0.052121 / 0.037052 (0.015069)	0.387787 / 0.258489 (0.129298)	0.434322 / 0.293841 (0.140481)	0.032598 / 0.128546 (-0.095948)	0.012126 / 0.075646 (-0.063520)	0.359658 / 0.419271 (-0.059613)	0.046686 / 0.043533 (0.003154)	0.391973 / 0.255139 (0.136834)	0.421149 / 0.283200 (0.137949)	0.105920 / 0.141683 (-0.035763)	1.483008 / 1.452155 (0.030854)	1.617010 / 1.492716 (0.124294)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199111 / 0.018006 (0.181105)	0.407995 / 0.000490 (0.407505)	0.006706 / 0.000200 (0.006506)	0.000229 / 0.000054 (0.000175)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030247 / 0.037411 (-0.007164)	0.115977 / 0.014526 (0.101451)	0.118112 / 0.176557 (-0.058444)	0.182710 / 0.737135 (-0.554426)	0.122483 / 0.296338 (-0.173855)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430455 / 0.215209 (0.215246)	4.314298 / 2.077655 (2.236643)	1.898124 / 1.504120 (0.394005)	1.734909 / 1.541195 (0.193715)	1.802400 / 1.468490 (0.333910)	0.717237 / 4.584777 (-3.867539)	4.004705 / 3.745712 (0.258993)	2.138901 / 5.269862 (-3.130960)	1.254037 / 4.565676 (-3.311640)	0.085594 / 0.424275 (-0.338681)	0.013774 / 0.007607 (0.006166)	0.535218 / 0.226044 (0.309174)	5.373730 / 2.268929 (3.104801)	2.371194 / 55.444624 (-53.073430)	2.111206 / 6.876477 (-4.765270)	2.225137 / 2.142072 (0.083064)	0.838325 / 4.805227 (-3.966902)	0.159176 / 6.500664 (-6.341488)	0.072285 / 0.075469 (-0.003184)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.352232 / 1.841788 (-0.489555)	16.926722 / 8.074308 (8.852414)	16.709531 / 10.191392 (6.518139)	0.159249 / 0.680424 (-0.521175)	0.017667 / 0.534201 (-0.516534)	0.426894 / 0.579283 (-0.152390)	0.539903 / 0.434364 (0.105539)	0.537471 / 0.540337 (-0.002866)	0.619592 / 1.386936 (-0.767344)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008354 / 0.011353 (-0.002999)	0.005366 / 0.011008 (-0.005642)	0.080961 / 0.038508 (0.042453)	0.046574 / 0.023109 (0.023465)	0.345949 / 0.275898 (0.070051)	0.394041 / 0.323480 (0.070562)	0.006209 / 0.007986 (-0.001777)	0.005980 / 0.004328 (0.001651)	0.076235 / 0.004250 (0.071984)	0.051833 / 0.037052 (0.014780)	0.348786 / 0.258489 (0.090297)	0.397421 / 0.293841 (0.103580)	0.033026 / 0.128546 (-0.095520)	0.012217 / 0.075646 (-0.063429)	0.087439 / 0.419271 (-0.331832)	0.045488 / 0.043533 (0.001955)	0.352160 / 0.255139 (0.097021)	0.379079 / 0.283200 (0.095879)	0.116111 / 0.141683 (-0.025572)	1.470177 / 1.452155 (0.018022)	1.587499 / 1.492716 (0.094783)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.296149 / 0.018006 (0.278143)	0.592362 / 0.000490 (0.591872)	0.000492 / 0.000200 (0.000292)	0.000064 / 0.000054 (0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036599 / 0.037411 (-0.000813)	0.113768 / 0.014526 (0.099242)	0.116198 / 0.176557 (-0.060358)	0.180329 / 0.737135 (-0.556806)	0.123942 / 0.296338 (-0.172396)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.452445 / 0.215209 (0.237236)	4.504330 / 2.077655 (2.426675)	2.275645 / 1.504120 (0.771525)	2.107765 / 1.541195 (0.566571)	2.086363 / 1.468490 (0.617873)	0.723721 / 4.584777 (-3.861056)	3.825330 / 3.745712 (0.079618)	2.162743 / 5.269862 (-3.107119)	1.255953 / 4.565676 (-3.309724)	0.085860 / 0.424275 (-0.338415)	0.013790 / 0.007607 (0.006183)	0.560257 / 0.226044 (0.334213)	5.618180 / 2.268929 (3.349251)	2.625423 / 55.444624 (-52.819202)	2.374381 / 6.876477 (-4.502095)	2.496560 / 2.142072 (0.354488)	0.841120 / 4.805227 (-3.964107)	0.161541 / 6.500664 (-6.339123)	0.075270 / 0.075469 (-0.000199)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.432916 / 1.841788 (-0.408872)	14.858534 / 8.074308 (6.784226)	14.973521 / 10.191392 (4.782129)	0.148312 / 0.680424 (-0.532112)	0.016811 / 0.534201 (-0.517390)	0.382623 / 0.579283 (-0.196660)	0.389767 / 0.434364 (-0.044596)	0.449657 / 0.540337 (-0.090680)	0.533723 / 1.386936 (-0.853214)

albertvillanova · 2023-04-25T14:25:15Z

I agree it is good to have a way to run the CI on push, without needing to open a PR.

But I think the branch name should be more generic (and this is not specific to this PR). See:

Allow to run CI on push to ci-branch #5790

github-actions · 2023-04-25T14:32:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007208 / 0.011353 (-0.004145)	0.005600 / 0.011008 (-0.005408)	0.096129 / 0.038508 (0.057621)	0.027834 / 0.023109 (0.004725)	0.295106 / 0.275898 (0.019208)	0.323983 / 0.323480 (0.000503)	0.005164 / 0.007986 (-0.002822)	0.003962 / 0.004328 (-0.000366)	0.078339 / 0.004250 (0.074089)	0.036974 / 0.037052 (-0.000078)	0.310315 / 0.258489 (0.051826)	0.338036 / 0.293841 (0.044195)	0.042124 / 0.128546 (-0.086422)	0.015886 / 0.075646 (-0.059760)	0.337961 / 0.419271 (-0.081310)	0.051507 / 0.043533 (0.007974)	0.297505 / 0.255139 (0.042366)	0.310728 / 0.283200 (0.027528)	0.086312 / 0.141683 (-0.055371)	1.356923 / 1.452155 (-0.095232)	1.429366 / 1.492716 (-0.063350)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.205495 / 0.018006 (0.187489)	0.460639 / 0.000490 (0.460149)	0.003996 / 0.000200 (0.003796)	0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021970 / 0.037411 (-0.015442)	0.090283 / 0.014526 (0.075757)	0.098579 / 0.176557 (-0.077978)	0.160437 / 0.737135 (-0.576699)	0.102738 / 0.296338 (-0.193600)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.494474 / 0.215209 (0.279265)	4.967453 / 2.077655 (2.889799)	2.045852 / 1.504120 (0.541732)	1.858022 / 1.541195 (0.316827)	1.771874 / 1.468490 (0.303384)	1.186368 / 4.584777 (-3.398408)	4.974762 / 3.745712 (1.229050)	2.616225 / 5.269862 (-2.653636)	1.702971 / 4.565676 (-2.862705)	0.124929 / 0.424275 (-0.299346)	0.011774 / 0.007607 (0.004167)	0.569643 / 0.226044 (0.343598)	5.793114 / 2.268929 (3.524186)	2.441561 / 55.444624 (-53.003064)	1.862233 / 6.876477 (-5.014243)	1.931142 / 2.142072 (-0.210931)	1.148915 / 4.805227 (-3.656313)	0.203914 / 6.500664 (-6.296750)	0.062468 / 0.075469 (-0.013001)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.188708 / 1.841788 (-0.653080)	13.710830 / 8.074308 (5.636522)	15.695153 / 10.191392 (5.503761)	0.171467 / 0.680424 (-0.508957)	0.024509 / 0.534201 (-0.509692)	0.450270 / 0.579283 (-0.129014)	0.500712 / 0.434364 (0.066348)	0.488632 / 0.540337 (-0.051706)	0.574893 / 1.386936 (-0.812043)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007254 / 0.011353 (-0.004099)	0.006199 / 0.011008 (-0.004809)	0.072079 / 0.038508 (0.033571)	0.026909 / 0.023109 (0.003800)	0.355538 / 0.275898 (0.079640)	0.358625 / 0.323480 (0.035145)	0.005564 / 0.007986 (-0.002421)	0.005278 / 0.004328 (0.000950)	0.076469 / 0.004250 (0.072219)	0.038269 / 0.037052 (0.001216)	0.355214 / 0.258489 (0.096725)	0.383219 / 0.293841 (0.089378)	0.046516 / 0.128546 (-0.082030)	0.015393 / 0.075646 (-0.060254)	0.088506 / 0.419271 (-0.330765)	0.050326 / 0.043533 (0.006793)	0.327265 / 0.255139 (0.072126)	0.370176 / 0.283200 (0.086976)	0.102438 / 0.141683 (-0.039245)	1.378969 / 1.452155 (-0.073186)	1.441998 / 1.492716 (-0.050719)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.209044 / 0.018006 (0.191038)	0.455733 / 0.000490 (0.455243)	0.005856 / 0.000200 (0.005656)	0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025336 / 0.037411 (-0.012075)	0.097449 / 0.014526 (0.082923)	0.106301 / 0.176557 (-0.070255)	0.153053 / 0.737135 (-0.584082)	0.107938 / 0.296338 (-0.188401)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.491070 / 0.215209 (0.275861)	5.049637 / 2.077655 (2.971982)	2.064709 / 1.504120 (0.560589)	1.782266 / 1.541195 (0.241072)	1.798570 / 1.468490 (0.330080)	0.988886 / 4.584777 (-3.595891)	4.690324 / 3.745712 (0.944612)	4.317355 / 5.269862 (-0.952507)	2.347596 / 4.565676 (-2.218081)	0.117249 / 0.424275 (-0.307026)	0.011614 / 0.007607 (0.004007)	0.630033 / 0.226044 (0.403988)	6.140108 / 2.268929 (3.871180)	2.638080 / 55.444624 (-52.806545)	2.133017 / 6.876477 (-4.743459)	2.123392 / 2.142072 (-0.018680)	1.178056 / 4.805227 (-3.627171)	0.209465 / 6.500664 (-6.291199)	0.063234 / 0.075469 (-0.012235)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.238089 / 1.841788 (-0.603699)	14.066866 / 8.074308 (5.992558)	16.225480 / 10.191392 (6.034088)	0.206466 / 0.680424 (-0.473958)	0.027279 / 0.534201 (-0.506922)	0.443006 / 0.579283 (-0.136277)	0.509512 / 0.434364 (0.075148)	0.479075 / 0.540337 (-0.061263)	0.573546 / 1.386936 (-0.813390)

lhoestq and others added 6 commits April 20, 2023 15:09

test hfh 0.14.0rc1

1f14984

fix style

becb31e

fix tests for hfh>=0.14

0c51b8e

Merge branch 'main' into test-hfh-rc-0.14

515196a

fix test token

a033fb2

package version

213c72f

albertvillanova approved these changes Apr 25, 2023

View reviewed changes

.github/workflows/ci.yml Outdated

@@ -7,6 +7,7 @@ on:

push:

branches:

- main

- test-hfh*

Copy link

Member

albertvillanova Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed.

albertvillanova requested changes Apr 25, 2023

View reviewed changes

albertvillanova approved these changes Apr 25, 2023

View reviewed changes

albertvillanova mentioned this pull request Apr 25, 2023

Allow to run CI on push to ci-branch #5790

Merged

Remove CI run on push to test-hfh* branch

f834435

albertvillanova merged commit c6015a0 into main Apr 25, 2023

albertvillanova deleted the prepare-tests-for-hfh-0.14 branch April 25, 2023 14:25

Prepare tests for hfh 0.14 #5788

Prepare tests for hfh 0.14 #5788

Conversation

Wauplin commented Apr 24, 2023

HuggingFaceDocBuilderDev commented Apr 24, 2023 • edited Loading

github-actions bot commented Apr 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Apr 25, 2023

Choose a reason for hiding this comment

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

Wauplin commented Apr 25, 2023 • edited Loading

github-actions bot commented Apr 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Apr 25, 2023

github-actions bot commented Apr 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Apr 24, 2023 •

edited

Loading

albertvillanova left a comment •

edited

Loading

Wauplin commented Apr 25, 2023 •

edited

Loading