Fix streaming parquet with image feature in schema #5921

lhoestq · 2023-06-01T15:23:10Z

It was not reading the feature type from the parquet arrow schema

HuggingFaceDocBuilderDev · 2023-06-01T15:27:50Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-06-01T15:29:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007088 / 0.011353 (-0.004265)	0.005216 / 0.011008 (-0.005793)	0.097572 / 0.038508 (0.059064)	0.036510 / 0.023109 (0.013401)	0.316885 / 0.275898 (0.040987)	0.348541 / 0.323480 (0.025061)	0.006513 / 0.007986 (-0.001473)	0.004579 / 0.004328 (0.000251)	0.073779 / 0.004250 (0.069529)	0.057500 / 0.037052 (0.020448)	0.329840 / 0.258489 (0.071351)	0.357530 / 0.293841 (0.063690)	0.028515 / 0.128546 (-0.100031)	0.009156 / 0.075646 (-0.066491)	0.328340 / 0.419271 (-0.090932)	0.068400 / 0.043533 (0.024867)	0.313692 / 0.255139 (0.058553)	0.329170 / 0.283200 (0.045971)	0.111969 / 0.141683 (-0.029714)	1.422096 / 1.452155 (-0.030059)	1.550042 / 1.492716 (0.057326)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.285113 / 0.018006 (0.267107)	0.546788 / 0.000490 (0.546298)	0.006992 / 0.000200 (0.006792)	0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026841 / 0.037411 (-0.010570)	0.108413 / 0.014526 (0.093887)	0.118375 / 0.176557 (-0.058181)	0.174889 / 0.737135 (-0.562246)	0.122781 / 0.296338 (-0.173558)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.404187 / 0.215209 (0.188978)	4.039673 / 2.077655 (1.962019)	1.894616 / 1.504120 (0.390496)	1.729182 / 1.541195 (0.187987)	1.772917 / 1.468490 (0.304427)	0.524046 / 4.584777 (-4.060731)	3.628111 / 3.745712 (-0.117601)	1.866075 / 5.269862 (-3.403787)	1.026435 / 4.565676 (-3.539242)	0.065328 / 0.424275 (-0.358947)	0.012717 / 0.007607 (0.005110)	0.505821 / 0.226044 (0.279777)	5.049518 / 2.268929 (2.780589)	2.338486 / 55.444624 (-53.106139)	2.002874 / 6.876477 (-4.873602)	2.193049 / 2.142072 (0.050976)	0.664638 / 4.805227 (-4.140589)	0.151323 / 6.500664 (-6.349341)	0.063774 / 0.075469 (-0.011695)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.168168 / 1.841788 (-0.673620)	15.289200 / 8.074308 (7.214891)	13.614249 / 10.191392 (3.422857)	0.167950 / 0.680424 (-0.512474)	0.017522 / 0.534201 (-0.516679)	0.393480 / 0.579283 (-0.185803)	0.420549 / 0.434364 (-0.013815)	0.461425 / 0.540337 (-0.078912)	0.563583 / 1.386936 (-0.823353)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006859 / 0.011353 (-0.004493)	0.004864 / 0.011008 (-0.006144)	0.075084 / 0.038508 (0.036576)	0.033989 / 0.023109 (0.010880)	0.372512 / 0.275898 (0.096614)	0.394725 / 0.323480 (0.071246)	0.006382 / 0.007986 (-0.001604)	0.004521 / 0.004328 (0.000193)	0.076422 / 0.004250 (0.072172)	0.055383 / 0.037052 (0.018331)	0.400974 / 0.258489 (0.142485)	0.411570 / 0.293841 (0.117729)	0.028264 / 0.128546 (-0.100282)	0.009123 / 0.075646 (-0.066523)	0.081257 / 0.419271 (-0.338015)	0.048147 / 0.043533 (0.004614)	0.390735 / 0.255139 (0.135596)	0.376426 / 0.283200 (0.093226)	0.108164 / 0.141683 (-0.033518)	1.429667 / 1.452155 (-0.022488)	1.556291 / 1.492716 (0.063575)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.289514 / 0.018006 (0.271508)	0.532860 / 0.000490 (0.532370)	0.003810 / 0.000200 (0.003611)	0.000121 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031292 / 0.037411 (-0.006119)	0.116530 / 0.014526 (0.102005)	0.127624 / 0.176557 (-0.048932)	0.178276 / 0.737135 (-0.558859)	0.133742 / 0.296338 (-0.162597)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.431505 / 0.215209 (0.216296)	4.309206 / 2.077655 (2.231551)	2.174779 / 1.504120 (0.670659)	1.998122 / 1.541195 (0.456927)	2.126478 / 1.468490 (0.657988)	0.528971 / 4.584777 (-4.055806)	3.797608 / 3.745712 (0.051895)	1.876275 / 5.269862 (-3.393586)	1.087458 / 4.565676 (-3.478218)	0.066940 / 0.424275 (-0.357335)	0.012432 / 0.007607 (0.004825)	0.538346 / 0.226044 (0.312301)	5.370968 / 2.268929 (3.102039)	2.613718 / 55.444624 (-52.830906)	2.246585 / 6.876477 (-4.629892)	2.375695 / 2.142072 (0.233622)	0.652227 / 4.805227 (-4.153001)	0.143246 / 6.500664 (-6.357418)	0.066163 / 0.075469 (-0.009306)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.291263 / 1.841788 (-0.550524)	16.532281 / 8.074308 (8.457973)	15.038471 / 10.191392 (4.847079)	0.168139 / 0.680424 (-0.512285)	0.017724 / 0.534201 (-0.516477)	0.391636 / 0.579283 (-0.187648)	0.429690 / 0.434364 (-0.004674)	0.474941 / 0.540337 (-0.065396)	0.579461 / 1.386936 (-0.807475)

github-actions · 2023-06-01T15:30:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006083 / 0.011353 (-0.005269)	0.004085 / 0.011008 (-0.006923)	0.098337 / 0.038508 (0.059829)	0.027573 / 0.023109 (0.004464)	0.305688 / 0.275898 (0.029790)	0.341767 / 0.323480 (0.018287)	0.005143 / 0.007986 (-0.002842)	0.003396 / 0.004328 (-0.000932)	0.076925 / 0.004250 (0.072674)	0.041027 / 0.037052 (0.003975)	0.307877 / 0.258489 (0.049388)	0.346559 / 0.293841 (0.052718)	0.025183 / 0.128546 (-0.103363)	0.008575 / 0.075646 (-0.067071)	0.319449 / 0.419271 (-0.099823)	0.043378 / 0.043533 (-0.000154)	0.304563 / 0.255139 (0.049424)	0.332019 / 0.283200 (0.048819)	0.087725 / 0.141683 (-0.053958)	1.484904 / 1.452155 (0.032749)	1.582780 / 1.492716 (0.090064)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.197503 / 0.018006 (0.179497)	0.410370 / 0.000490 (0.409880)	0.003840 / 0.000200 (0.003640)	0.000067 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024179 / 0.037411 (-0.013232)	0.098876 / 0.014526 (0.084350)	0.106189 / 0.176557 (-0.070367)	0.168964 / 0.737135 (-0.568171)	0.109723 / 0.296338 (-0.186616)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.429453 / 0.215209 (0.214244)	4.295584 / 2.077655 (2.217929)	2.014330 / 1.504120 (0.510210)	1.841119 / 1.541195 (0.299924)	1.928378 / 1.468490 (0.459888)	0.554571 / 4.584777 (-4.030206)	3.431769 / 3.745712 (-0.313943)	1.716204 / 5.269862 (-3.553658)	0.995054 / 4.565676 (-3.570622)	0.067374 / 0.424275 (-0.356902)	0.012557 / 0.007607 (0.004950)	0.533785 / 0.226044 (0.307740)	5.363360 / 2.268929 (3.094431)	2.535190 / 55.444624 (-52.909434)	2.191646 / 6.876477 (-4.684831)	2.400799 / 2.142072 (0.258727)	0.663961 / 4.805227 (-4.141266)	0.135992 / 6.500664 (-6.364672)	0.067378 / 0.075469 (-0.008092)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.235110 / 1.841788 (-0.606678)	13.820695 / 8.074308 (5.746387)	13.667202 / 10.191392 (3.475810)	0.143025 / 0.680424 (-0.537399)	0.016757 / 0.534201 (-0.517444)	0.356262 / 0.579283 (-0.223021)	0.401871 / 0.434364 (-0.032493)	0.423928 / 0.540337 (-0.116410)	0.514598 / 1.386936 (-0.872338)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006260 / 0.011353 (-0.005093)	0.004159 / 0.011008 (-0.006850)	0.076780 / 0.038508 (0.038272)	0.027899 / 0.023109 (0.004789)	0.412756 / 0.275898 (0.136858)	0.455145 / 0.323480 (0.131665)	0.005029 / 0.007986 (-0.002956)	0.003482 / 0.004328 (-0.000847)	0.076148 / 0.004250 (0.071898)	0.038969 / 0.037052 (0.001917)	0.429975 / 0.258489 (0.171486)	0.465880 / 0.293841 (0.172039)	0.025555 / 0.128546 (-0.102991)	0.008612 / 0.075646 (-0.067034)	0.082604 / 0.419271 (-0.336667)	0.039690 / 0.043533 (-0.003842)	0.403644 / 0.255139 (0.148505)	0.440438 / 0.283200 (0.157238)	0.090984 / 0.141683 (-0.050699)	1.465915 / 1.452155 (0.013760)	1.564227 / 1.492716 (0.071511)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.010502 / 0.018006 (-0.007504)	0.410573 / 0.000490 (0.410083)	0.000384 / 0.000200 (0.000184)	0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025726 / 0.037411 (-0.011686)	0.101760 / 0.014526 (0.087235)	0.110102 / 0.176557 (-0.066454)	0.161321 / 0.737135 (-0.575815)	0.112507 / 0.296338 (-0.183832)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.469925 / 0.215209 (0.254716)	4.718740 / 2.077655 (2.641085)	2.466272 / 1.504120 (0.962152)	2.267357 / 1.541195 (0.726162)	2.331343 / 1.468490 (0.862853)	0.553448 / 4.584777 (-4.031329)	3.464228 / 3.745712 (-0.281484)	3.060957 / 5.269862 (-2.208905)	1.387261 / 4.565676 (-3.178415)	0.067989 / 0.424275 (-0.356286)	0.012349 / 0.007607 (0.004741)	0.575046 / 0.226044 (0.349001)	5.740322 / 2.268929 (3.471394)	2.925666 / 55.444624 (-52.518958)	2.606535 / 6.876477 (-4.269942)	2.658144 / 2.142072 (0.516072)	0.655157 / 4.805227 (-4.150071)	0.138520 / 6.500664 (-6.362144)	0.069442 / 0.075469 (-0.006027)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.306523 / 1.841788 (-0.535265)	14.400380 / 8.074308 (6.326072)	14.231519 / 10.191392 (4.040127)	0.146194 / 0.680424 (-0.534230)	0.016632 / 0.534201 (-0.517569)	0.361151 / 0.579283 (-0.218132)	0.388838 / 0.434364 (-0.045526)	0.419337 / 0.540337 (-0.121001)	0.500483 / 1.386936 (-0.886453)

albertvillanova

Good catch! Thanks.

github-actions · 2023-06-02T10:02:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009430 / 0.011353 (-0.001923)	0.006673 / 0.011008 (-0.004335)	0.125151 / 0.038508 (0.086643)	0.038258 / 0.023109 (0.015149)	0.426383 / 0.275898 (0.150485)	0.432327 / 0.323480 (0.108847)	0.006964 / 0.007986 (-0.001022)	0.005140 / 0.004328 (0.000811)	0.100767 / 0.004250 (0.096517)	0.058663 / 0.037052 (0.021610)	0.424709 / 0.258489 (0.166220)	0.453049 / 0.293841 (0.159208)	0.051042 / 0.128546 (-0.077505)	0.015291 / 0.075646 (-0.060355)	0.456549 / 0.419271 (0.037278)	0.067106 / 0.043533 (0.023573)	0.408959 / 0.255139 (0.153820)	0.445067 / 0.283200 (0.161867)	0.115590 / 0.141683 (-0.026092)	1.929439 / 1.452155 (0.477284)	2.045709 / 1.492716 (0.552992)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.250726 / 0.018006 (0.232720)	0.598976 / 0.000490 (0.598486)	0.007542 / 0.000200 (0.007342)	0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030317 / 0.037411 (-0.007094)	0.133177 / 0.014526 (0.118651)	0.152761 / 0.176557 (-0.023795)	0.233708 / 0.737135 (-0.503428)	0.147303 / 0.296338 (-0.149036)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.633562 / 0.215209 (0.418353)	6.235021 / 2.077655 (4.157366)	2.652573 / 1.504120 (1.148454)	2.223363 / 1.541195 (0.682168)	2.231022 / 1.468490 (0.762531)	0.942218 / 4.584777 (-3.642559)	6.068661 / 3.745712 (2.322949)	2.778604 / 5.269862 (-2.491257)	1.787939 / 4.565676 (-2.777737)	0.117749 / 0.424275 (-0.306526)	0.015613 / 0.007607 (0.008006)	0.810222 / 0.226044 (0.584177)	7.931509 / 2.268929 (5.662581)	3.260679 / 55.444624 (-52.183945)	2.609085 / 6.876477 (-4.267391)	2.867838 / 2.142072 (0.725766)	1.144672 / 4.805227 (-3.660555)	0.224379 / 6.500664 (-6.276285)	0.084490 / 0.075469 (0.009021)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.650608 / 1.841788 (-0.191179)	18.919748 / 8.074308 (10.845440)	20.163162 / 10.191392 (9.971770)	0.229427 / 0.680424 (-0.450997)	0.033090 / 0.534201 (-0.501111)	0.535549 / 0.579283 (-0.043734)	0.658629 / 0.434364 (0.224265)	0.631526 / 0.540337 (0.091189)	0.748701 / 1.386936 (-0.638235)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009157 / 0.011353 (-0.002196)	0.006153 / 0.011008 (-0.004856)	0.106294 / 0.038508 (0.067786)	0.040947 / 0.023109 (0.017837)	0.493242 / 0.275898 (0.217344)	0.563525 / 0.323480 (0.240045)	0.007256 / 0.007986 (-0.000730)	0.006757 / 0.004328 (0.002429)	0.105151 / 0.004250 (0.100901)	0.056262 / 0.037052 (0.019209)	0.573341 / 0.258489 (0.314852)	0.591125 / 0.293841 (0.297284)	0.047935 / 0.128546 (-0.080611)	0.015385 / 0.075646 (-0.060262)	0.119457 / 0.419271 (-0.299814)	0.066510 / 0.043533 (0.022977)	0.485622 / 0.255139 (0.230483)	0.540929 / 0.283200 (0.257730)	0.132619 / 0.141683 (-0.009064)	1.916905 / 1.452155 (0.464750)	2.152722 / 1.492716 (0.660006)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.294823 / 0.018006 (0.276817)	0.569371 / 0.000490 (0.568882)	0.000642 / 0.000200 (0.000442)	0.000091 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034321 / 0.037411 (-0.003090)	0.134165 / 0.014526 (0.119639)	0.157871 / 0.176557 (-0.018685)	0.210753 / 0.737135 (-0.526382)	0.152961 / 0.296338 (-0.143377)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.686810 / 0.215209 (0.471601)	6.890432 / 2.077655 (4.812778)	3.182875 / 1.504120 (1.678755)	2.770836 / 1.541195 (1.229641)	2.790785 / 1.468490 (1.322295)	0.938145 / 4.584777 (-3.646632)	5.861093 / 3.745712 (2.115381)	2.719862 / 5.269862 (-2.550000)	1.760834 / 4.565676 (-2.804842)	0.111317 / 0.424275 (-0.312958)	0.015722 / 0.007607 (0.008115)	0.863032 / 0.226044 (0.636988)	8.482433 / 2.268929 (6.213504)	3.892621 / 55.444624 (-51.552003)	3.207370 / 6.876477 (-3.669106)	3.344412 / 2.142072 (1.202339)	1.133903 / 4.805227 (-3.671324)	0.223456 / 6.500664 (-6.277209)	0.084335 / 0.075469 (0.008866)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.794116 / 1.841788 (-0.047672)	19.077447 / 8.074308 (11.003139)	23.102309 / 10.191392 (12.910917)	0.268806 / 0.680424 (-0.411617)	0.027709 / 0.534201 (-0.506492)	0.540488 / 0.579283 (-0.038796)	0.658478 / 0.434364 (0.224114)	0.604769 / 0.540337 (0.064431)	0.722768 / 1.386936 (-0.664168)

lhoestq added 2 commits June 1, 2023 17:21

fix streaming parquet with image feature in schema

db690af

minor

c0429e9

lhoestq requested a review from mariosasko June 1, 2023 17:23

albertvillanova approved these changes Jun 2, 2023

View reviewed changes

lhoestq merged commit 7e52021 into main Jun 2, 2023

lhoestq deleted the fix-streaming-parquet-with-image-feature-in-schema branch June 2, 2023 09:53

This was referenced Jun 12, 2023

Image Encoding Issue when submitting a Parquet Dataset #5869

Closed

Update datasets to 2.13.0 huggingface/dataset-viewer#1370

Closed

severo mentioned this pull request Jun 15, 2023

Update datasets dependency to 2.13.0 version huggingface/dataset-viewer#1372

Merged

Fix streaming parquet with image feature in schema #5921

Fix streaming parquet with image feature in schema #5921

Conversation

lhoestq commented Jun 1, 2023

HuggingFaceDocBuilderDev commented Jun 1, 2023 • edited Loading

github-actions bot commented Jun 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jun 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jun 1, 2023 •

edited

Loading