Support streaming Beam datasets from HF GCS preprocessed data #5689

albertvillanova · 2023-03-31T08:44:24Z

This PR implements streaming Apache Beam datasets that are already preprocessed by us and stored in the HF Google Cloud Storage:

natural_questions
wiki40b
wikipedia

This is done by streaming from the prepared Arrow files in HF Google Cloud Storage.

This will fix their corresponding dataset viewers. Related to:

Split first rows from parquet new Job Runner dataset-viewer#988 (comment)

Related to:

CC: @severo

HuggingFaceDocBuilderDev · 2023-03-31T08:48:25Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova · 2023-03-31T08:48:50Z

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset("wikipedia", "20220301.en", split="train", streaming=True); item = next(iter(ds)); item
Out[2]: 
{'id': '12',
 'url': 'https://en.wikipedia.org/wiki/Anarchism',
 'title': 'Anarchism',
 'text': 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement,...}

severo · 2023-03-31T08:56:26Z

I love your example 🏴‍🅰️

lhoestq

Amazing thanks !!

You could also have a simple integration test in test_hf_gcp.py for wikipedia to make sure it keeps working in the long run

src/datasets/builder.py

lhoestq · 2023-04-04T15:33:42Z

src/datasets/builder.py

+        return ExamplesIterable(self._generate_examples_from_hf_gcs, {"split": split})
+
+    def _generate_examples_from_hf_gcs(self, split):
+        remote_prepared_filename = f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow"


(nit) Beam builders may create sharded arrow files now.

It's the case for none of the datasets on GCP but if we regenerate a new one it may en up sharded.
You can check the dataset info.shard_lengths to know if it's sharded or not and how many shards there are.

if self.info.splits[split].shard_lengths: num_shards = len(self.info.splits[split].shard_lengths) urls = [ f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.arrow" for shard_id in range(num_shards) ] else: urls = [f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow"]

edit: fixed self.info.splits[split].shard_lengths

I see... Thanks.
Let me refactor the code.

As the attribute shard_lengths belongs to SplitInfo (and not DatasetInfo), I have refactored the method so that split is now of type SplitInfo instead of str.

lhoestq

LGTM ! :)

github-actions · 2023-04-12T05:57:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007859 / 0.011353 (-0.003493)	0.005129 / 0.011008 (-0.005879)	0.098070 / 0.038508 (0.059562)	0.036500 / 0.023109 (0.013391)	0.311575 / 0.275898 (0.035677)	0.338351 / 0.323480 (0.014872)	0.005962 / 0.007986 (-0.002024)	0.004060 / 0.004328 (-0.000268)	0.072970 / 0.004250 (0.068719)	0.049289 / 0.037052 (0.012237)	0.310303 / 0.258489 (0.051814)	0.347449 / 0.293841 (0.053608)	0.046912 / 0.128546 (-0.081634)	0.011952 / 0.075646 (-0.063694)	0.333600 / 0.419271 (-0.085671)	0.052700 / 0.043533 (0.009167)	0.325486 / 0.255139 (0.070347)	0.326920 / 0.283200 (0.043720)	0.107683 / 0.141683 (-0.034000)	1.416679 / 1.452155 (-0.035476)	1.502418 / 1.492716 (0.009702)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216520 / 0.018006 (0.198514)	0.448450 / 0.000490 (0.447960)	0.004213 / 0.000200 (0.004013)	0.000082 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027081 / 0.037411 (-0.010331)	0.110989 / 0.014526 (0.096463)	0.116087 / 0.176557 (-0.060470)	0.173771 / 0.737135 (-0.563364)	0.121240 / 0.296338 (-0.175099)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.399938 / 0.215209 (0.184729)	4.017665 / 2.077655 (1.940010)	1.782327 / 1.504120 (0.278207)	1.612955 / 1.541195 (0.071761)	1.698839 / 1.468490 (0.230349)	0.706702 / 4.584777 (-3.878075)	4.533425 / 3.745712 (0.787713)	2.102611 / 5.269862 (-3.167250)	1.461429 / 4.565676 (-3.104248)	0.085719 / 0.424275 (-0.338556)	0.012104 / 0.007607 (0.004497)	0.507397 / 0.226044 (0.281352)	5.061572 / 2.268929 (2.792643)	2.272106 / 55.444624 (-53.172518)	1.935575 / 6.876477 (-4.940901)	2.102541 / 2.142072 (-0.039532)	0.838395 / 4.805227 (-3.966832)	0.168573 / 6.500664 (-6.332091)	0.064234 / 0.075469 (-0.011235)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.190077 / 1.841788 (-0.651710)	15.765587 / 8.074308 (7.691279)	14.694626 / 10.191392 (4.503234)	0.142912 / 0.680424 (-0.537512)	0.017669 / 0.534201 (-0.516532)	0.421502 / 0.579283 (-0.157781)	0.452732 / 0.434364 (0.018368)	0.497480 / 0.540337 (-0.042857)	0.586310 / 1.386936 (-0.800626)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007629 / 0.011353 (-0.003724)	0.005330 / 0.011008 (-0.005679)	0.076366 / 0.038508 (0.037858)	0.034703 / 0.023109 (0.011593)	0.356300 / 0.275898 (0.080402)	0.392909 / 0.323480 (0.069429)	0.005959 / 0.007986 (-0.002026)	0.004140 / 0.004328 (-0.000188)	0.075289 / 0.004250 (0.071039)	0.047880 / 0.037052 (0.010828)	0.357289 / 0.258489 (0.098800)	0.404554 / 0.293841 (0.110714)	0.037182 / 0.128546 (-0.091365)	0.012266 / 0.075646 (-0.063380)	0.088554 / 0.419271 (-0.330718)	0.049698 / 0.043533 (0.006165)	0.353453 / 0.255139 (0.098314)	0.373252 / 0.283200 (0.090052)	0.101892 / 0.141683 (-0.039791)	1.481534 / 1.452155 (0.029380)	1.553818 / 1.492716 (0.061102)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229891 / 0.018006 (0.211884)	0.452444 / 0.000490 (0.451954)	0.000434 / 0.000200 (0.000234)	0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030170 / 0.037411 (-0.007241)	0.115097 / 0.014526 (0.100571)	0.122094 / 0.176557 (-0.054463)	0.171352 / 0.737135 (-0.565784)	0.128441 / 0.296338 (-0.167898)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.428347 / 0.215209 (0.213138)	4.266243 / 2.077655 (2.188588)	2.148327 / 1.504120 (0.644207)	1.874141 / 1.541195 (0.332946)	1.968737 / 1.468490 (0.500246)	0.715320 / 4.584777 (-3.869457)	4.166097 / 3.745712 (0.420384)	2.169550 / 5.269862 (-3.100312)	1.377441 / 4.565676 (-3.188236)	0.086376 / 0.424275 (-0.337899)	0.012018 / 0.007607 (0.004411)	0.517433 / 0.226044 (0.291388)	5.167327 / 2.268929 (2.898398)	2.545822 / 55.444624 (-52.898803)	2.241726 / 6.876477 (-4.634751)	2.327220 / 2.142072 (0.185147)	0.841618 / 4.805227 (-3.963609)	0.169473 / 6.500664 (-6.331191)	0.065505 / 0.075469 (-0.009964)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.270476 / 1.841788 (-0.571312)	17.049885 / 8.074308 (8.975577)	14.847615 / 10.191392 (4.656223)	0.168671 / 0.680424 (-0.511753)	0.017564 / 0.534201 (-0.516637)	0.424780 / 0.579283 (-0.154503)	0.517392 / 0.434364 (0.083028)	0.561197 / 0.540337 (0.020859)	0.697792 / 1.386936 (-0.689144)

albertvillanova added 3 commits March 31, 2023 10:30

Remove error when streaming BeamBasedBuilder

d74b47f

Implement BeamBasedBuilder.as_streaming_dataset

07309ea

Test BeamBasedBuilder.as_streaming_dataset raises

3b35948

albertvillanova changed the title ~~Support streaming Apache Beam datasets from HF Google Cloud Storage preprocessed data~~ Support streaming Apache Beam datasets from HF GCS preprocessed data Mar 31, 2023

albertvillanova changed the title ~~Support streaming Apache Beam datasets from HF GCS preprocessed data~~ Support streaming Beam datasets from HF GCS preprocessed data Mar 31, 2023

Remove draft comment

3c7ac3e

albertvillanova requested a review from lhoestq April 4, 2023 06:24

lhoestq reviewed Apr 4, 2023

View reviewed changes

src/datasets/builder.py Show resolved Hide resolved

src/datasets/builder.py Outdated Show resolved Hide resolved

albertvillanova added 6 commits April 4, 2023 14:10

Move local imports to module level

bdd1346

Test as_streaming_dataset on wikipedia

bc99c6e

Remove unnecessary beam_runner from test

71a5e45

Refactor tests

10f1b33

Refactor BeamBasedBuilder

4c1e25e

Refactor test

c53b168

lhoestq reviewed Apr 4, 2023

View reviewed changes

Refactor to support sharded files

f7d5922

lhoestq approved these changes Apr 7, 2023

View reviewed changes

albertvillanova merged commit ce06edf into huggingface:main Apr 12, 2023

albertvillanova deleted the stream-beam branch April 12, 2023 05:50

mariosasko mentioned this pull request Jul 20, 2023

Enable streaming for Wikipedia corpora #2808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming Beam datasets from HF GCS preprocessed data #5689

Support streaming Beam datasets from HF GCS preprocessed data #5689

albertvillanova commented Mar 31, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

Loading

albertvillanova commented Mar 31, 2023 •

edited

Loading

severo commented Mar 31, 2023

lhoestq left a comment

lhoestq Apr 4, 2023 •

edited

Loading

albertvillanova Apr 6, 2023

albertvillanova Apr 6, 2023

lhoestq left a comment

github-actions bot commented Apr 12, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Support streaming Beam datasets from HF GCS preprocessed data #5689

Support streaming Beam datasets from HF GCS preprocessed data #5689

Conversation

albertvillanova commented Mar 31, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 31, 2023 • edited Loading

albertvillanova commented Mar 31, 2023 • edited Loading

severo commented Mar 31, 2023

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

albertvillanova Apr 6, 2023

Choose a reason for hiding this comment

albertvillanova Apr 6, 2023

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 12, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Mar 31, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

Loading

albertvillanova commented Mar 31, 2023 •

edited

Loading

lhoestq Apr 4, 2023 •

edited

Loading