Remove tasks #6999

albertvillanova · 2024-06-25T09:06:16Z

Remove tasks, as part of the 3.0 release.

HuggingFaceDocBuilderDev · 2024-06-25T13:42:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-06-27T15:22:29Z

src/datasets/info.py

@@ -146,7 +143,6 @@ class DatasetInfo:
    features: Optional[Features] = None
    post_processed: Optional[PostProcessedInfo] = None
    supervised_keys: Optional[SupervisedKeysData] = None
-    task_templates: Optional[List[TaskTemplate]] = None


Maybe we can leave this (or ignore it in a post_init) since otherwise it will break ILSVRC/imagenet-1k and ylecun/mnist and many other datasets.

We could also keep the task classes to be imported and instantiated without errors but still remove all their methods like align_feature since they are unused

So we keep these deprecated classes and parameters for datasets 3.0...

Alternatively we could just hardcode a patch in dataset-viewer to keep supporting imagenet and mnist until they are converted to no-code datasets ?

Actually breaking this will be in the right direction of making people stop using code datasets. I'm just concerned that the mnist repo will stop working but if ylecun doesn't merge the PR to convert the dataset to parquet we can still hardcode something for the viewer

(and imagenet we can merge by ourselves)

I understand your concerns, but at the same time I would push to not keep the deprecated tasks (since version 2.13.0, more than a year ago) in the new major version.

So I would propose, before making the next release:

Identify all the datasets using tasks

Open PRs to convert them to Parquet

Wait 1/2 weeks for the owners to merge before forcing the merge ourselves for "maintenance" reasons

Only then, make the release

I can work on that with a script.

Sounds good, though no need to convert all the datasets - just the ones that are important

github-actions · 2024-08-21T09:07:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005330 / 0.011353 (-0.006023)	0.003946 / 0.011008 (-0.007062)	0.063530 / 0.038508 (0.025022)	0.030529 / 0.023109 (0.007419)	0.239364 / 0.275898 (-0.036534)	0.261683 / 0.323480 (-0.061797)	0.003197 / 0.007986 (-0.004789)	0.003485 / 0.004328 (-0.000844)	0.049575 / 0.004250 (0.045325)	0.046164 / 0.037052 (0.009112)	0.246129 / 0.258489 (-0.012360)	0.281365 / 0.293841 (-0.012476)	0.029480 / 0.128546 (-0.099066)	0.012450 / 0.075646 (-0.063196)	0.203696 / 0.419271 (-0.215575)	0.036539 / 0.043533 (-0.006994)	0.241664 / 0.255139 (-0.013475)	0.260930 / 0.283200 (-0.022270)	0.019931 / 0.141683 (-0.121752)	1.221075 / 1.452155 (-0.231080)	1.246315 / 1.492716 (-0.246402)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095061 / 0.018006 (0.077055)	0.304773 / 0.000490 (0.304283)	0.000208 / 0.000200 (0.000008)	0.000050 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019032 / 0.037411 (-0.018380)	0.062521 / 0.014526 (0.047995)	0.075668 / 0.176557 (-0.100889)	0.121634 / 0.737135 (-0.615501)	0.075456 / 0.296338 (-0.220882)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291721 / 0.215209 (0.076512)	2.845445 / 2.077655 (0.767790)	1.450971 / 1.504120 (-0.053149)	1.334586 / 1.541195 (-0.206609)	1.358095 / 1.468490 (-0.110396)	0.729624 / 4.584777 (-3.855153)	2.411504 / 3.745712 (-1.334208)	2.858871 / 5.269862 (-2.410991)	1.893074 / 4.565676 (-2.672603)	0.079068 / 0.424275 (-0.345207)	0.005476 / 0.007607 (-0.002131)	0.329816 / 0.226044 (0.103771)	3.305361 / 2.268929 (1.036432)	1.799924 / 55.444624 (-53.644700)	1.512130 / 6.876477 (-5.364347)	1.635195 / 2.142072 (-0.506877)	0.801486 / 4.805227 (-4.003741)	0.134677 / 6.500664 (-6.365987)	0.042266 / 0.075469 (-0.033203)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.969835 / 1.841788 (-0.871952)	11.421833 / 8.074308 (3.347524)	9.799120 / 10.191392 (-0.392272)	0.144888 / 0.680424 (-0.535536)	0.014191 / 0.534201 (-0.520010)	0.301037 / 0.579283 (-0.278246)	0.263329 / 0.434364 (-0.171034)	0.403013 / 0.540337 (-0.137324)	0.463805 / 1.386936 (-0.923131)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005913 / 0.011353 (-0.005440)	0.003890 / 0.011008 (-0.007118)	0.049995 / 0.038508 (0.011487)	0.032497 / 0.023109 (0.009387)	0.269926 / 0.275898 (-0.005972)	0.295567 / 0.323480 (-0.027913)	0.004365 / 0.007986 (-0.003620)	0.002818 / 0.004328 (-0.001510)	0.049055 / 0.004250 (0.044805)	0.040683 / 0.037052 (0.003630)	0.283043 / 0.258489 (0.024554)	0.321072 / 0.293841 (0.027232)	0.032760 / 0.128546 (-0.095787)	0.012370 / 0.075646 (-0.063277)	0.061574 / 0.419271 (-0.357698)	0.033714 / 0.043533 (-0.009819)	0.276287 / 0.255139 (0.021148)	0.290078 / 0.283200 (0.006878)	0.017250 / 0.141683 (-0.124432)	1.165291 / 1.452155 (-0.286863)	1.213687 / 1.492716 (-0.279029)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096122 / 0.018006 (0.078115)	0.311954 / 0.000490 (0.311464)	0.000213 / 0.000200 (0.000013)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022142 / 0.037411 (-0.015270)	0.076470 / 0.014526 (0.061945)	0.088340 / 0.176557 (-0.088216)	0.128594 / 0.737135 (-0.608542)	0.089780 / 0.296338 (-0.206558)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.298129 / 0.215209 (0.082920)	2.943735 / 2.077655 (0.866080)	1.574351 / 1.504120 (0.070231)	1.446688 / 1.541195 (-0.094506)	1.477714 / 1.468490 (0.009223)	0.722195 / 4.584777 (-3.862582)	0.967675 / 3.745712 (-2.778037)	2.803346 / 5.269862 (-2.466515)	1.895882 / 4.565676 (-2.669794)	0.079193 / 0.424275 (-0.345082)	0.005250 / 0.007607 (-0.002357)	0.350193 / 0.226044 (0.124149)	3.514562 / 2.268929 (1.245634)	1.962743 / 55.444624 (-53.481881)	1.677308 / 6.876477 (-5.199169)	1.811473 / 2.142072 (-0.330600)	0.796234 / 4.805227 (-4.008993)	0.131810 / 6.500664 (-6.368854)	0.041301 / 0.075469 (-0.034168)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.030700 / 1.841788 (-0.811088)	12.108809 / 8.074308 (4.034501)	10.426112 / 10.191392 (0.234720)	0.139829 / 0.680424 (-0.540595)	0.015133 / 0.534201 (-0.519068)	0.307782 / 0.579283 (-0.271501)	0.130554 / 0.434364 (-0.303810)	0.342728 / 0.540337 (-0.197610)	0.435426 / 1.386936 (-0.951510)

albertvillanova added 9 commits June 25, 2024 10:19

Delete tasks tests

47fd307

Delete task templates tests

38783bd

Delete tasks from folder based builders

8305064

Delete prepare_for_task methods

92d2735

Delete DatasetInfo.task_templates

c138213

Remove tasks subpackage

a0e8f4e

Delete tasks docs

8ae8644

Delete transmit_tasks

92b11a4

Delete prepare_for_task methods from docs

13c868c

albertvillanova added 2 commits June 25, 2024 15:48

Delete tasks from folder based builders docstring

95d206d

Delete deprecated task parameter of load_dataset

304eeb7

albertvillanova added this to the 3.0 milestone Jun 27, 2024

albertvillanova mentioned this pull request Jun 27, 2024

Remove deprecated code #6996

Merged

3 tasks

albertvillanova requested a review from a team June 27, 2024 13:31

lhoestq reviewed Jun 27, 2024

View reviewed changes

Merge branch 'main' into rm-tasks

879cfce

albertvillanova merged commit 9ddea80 into main Aug 21, 2024
15 checks passed

albertvillanova deleted the rm-tasks branch August 21, 2024 09:01

vidyasiv mentioned this pull request Sep 18, 2024

Dataset v3.0.0 deprecates tasks and cause CI failures huggingface/optimum-habana#1341

Closed

4 tasks

tibor-reiss mentioned this pull request Oct 24, 2024

ModuleNotFoundError: No module named 'datasets.tasks' #7248

Open

bwanglzu mentioned this pull request Oct 30, 2024

No module named 'datasets.tasks' embeddings-benchmark/mteb#1363

Closed

mjun0812 mentioned this pull request Nov 22, 2024

Remove datasets.tasks and add trust_remote_code shunk031/huggingface-datasets_JGLUE#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tasks #6999

Remove tasks #6999

albertvillanova commented Jun 25, 2024

HuggingFaceDocBuilderDev commented Jun 25, 2024

lhoestq Jun 27, 2024

albertvillanova Jul 1, 2024

lhoestq Jul 2, 2024

lhoestq Jul 2, 2024

albertvillanova Jul 3, 2024

lhoestq Jul 3, 2024

github-actions bot commented Aug 21, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Remove tasks #6999

Remove tasks #6999

Conversation

albertvillanova commented Jun 25, 2024

HuggingFaceDocBuilderDev commented Jun 25, 2024

lhoestq Jun 27, 2024

Choose a reason for hiding this comment

albertvillanova Jul 1, 2024

Choose a reason for hiding this comment

lhoestq Jul 2, 2024

Choose a reason for hiding this comment

lhoestq Jul 2, 2024

Choose a reason for hiding this comment

albertvillanova Jul 3, 2024

Choose a reason for hiding this comment

lhoestq Jul 3, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 21, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json