Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming Beam datasets from HF GCS preprocessed data #5689

Merged
merged 11 commits into from
Apr 12, 2023

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Mar 31, 2023

This PR implements streaming Apache Beam datasets that are already preprocessed by us and stored in the HF Google Cloud Storage:

  • natural_questions
  • wiki40b
  • wikipedia

This is done by streaming from the prepared Arrow files in HF Google Cloud Storage.

This will fix their corresponding dataset viewers. Related to:

Related to:

CC: @severo

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 31, 2023

The documentation is not available anymore as the PR was closed or merged.

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 31, 2023

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset("wikipedia", "20220301.en", split="train", streaming=True); item = next(iter(ds)); item
Out[2]: 
{'id': '12',
 'url': 'https://en.wikipedia.org/wiki/Anarchism',
 'title': 'Anarchism',
 'text': 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement,...}

@albertvillanova albertvillanova changed the title Support streaming Apache Beam datasets from HF Google Cloud Storage preprocessed data Support streaming Apache Beam datasets from HF GCS preprocessed data Mar 31, 2023
@albertvillanova albertvillanova changed the title Support streaming Apache Beam datasets from HF GCS preprocessed data Support streaming Beam datasets from HF GCS preprocessed data Mar 31, 2023
@severo
Copy link
Collaborator

severo commented Mar 31, 2023

I love your example 🏴‍🅰️

@albertvillanova albertvillanova requested a review from lhoestq April 4, 2023 06:24
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing thanks !!

You could also have a simple integration test in test_hf_gcp.py for wikipedia to make sure it keeps working in the long run

src/datasets/builder.py Show resolved Hide resolved
src/datasets/builder.py Outdated Show resolved Hide resolved
return ExamplesIterable(self._generate_examples_from_hf_gcs, {"split": split})

def _generate_examples_from_hf_gcs(self, split):
remote_prepared_filename = f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow"
Copy link
Member

@lhoestq lhoestq Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Beam builders may create sharded arrow files now.

It's the case for none of the datasets on GCP but if we regenerate a new one it may en up sharded.
You can check the dataset info.shard_lengths to know if it's sharded or not and how many shards there are.

    if self.info.splits[split].shard_lengths:
        num_shards = len(self.info.splits[split].shard_lengths)
        urls = [
            f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.arrow"
            for shard_id in range(num_shards)
        ]
    else:
        urls = [f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow"]

edit: fixed self.info.splits[split].shard_lengths

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... Thanks.
Let me refactor the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the attribute shard_lengths belongs to SplitInfo (and not DatasetInfo), I have refactored the method so that split is now of type SplitInfo instead of str.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! :)

@albertvillanova albertvillanova merged commit ce06edf into huggingface:main Apr 12, 2023
@albertvillanova albertvillanova deleted the stream-beam branch April 12, 2023 05:50
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007859 / 0.011353 (-0.003493) 0.005129 / 0.011008 (-0.005879) 0.098070 / 0.038508 (0.059562) 0.036500 / 0.023109 (0.013391) 0.311575 / 0.275898 (0.035677) 0.338351 / 0.323480 (0.014872) 0.005962 / 0.007986 (-0.002024) 0.004060 / 0.004328 (-0.000268) 0.072970 / 0.004250 (0.068719) 0.049289 / 0.037052 (0.012237) 0.310303 / 0.258489 (0.051814) 0.347449 / 0.293841 (0.053608) 0.046912 / 0.128546 (-0.081634) 0.011952 / 0.075646 (-0.063694) 0.333600 / 0.419271 (-0.085671) 0.052700 / 0.043533 (0.009167) 0.325486 / 0.255139 (0.070347) 0.326920 / 0.283200 (0.043720) 0.107683 / 0.141683 (-0.034000) 1.416679 / 1.452155 (-0.035476) 1.502418 / 1.492716 (0.009702)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.216520 / 0.018006 (0.198514) 0.448450 / 0.000490 (0.447960) 0.004213 / 0.000200 (0.004013) 0.000082 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027081 / 0.037411 (-0.010331) 0.110989 / 0.014526 (0.096463) 0.116087 / 0.176557 (-0.060470) 0.173771 / 0.737135 (-0.563364) 0.121240 / 0.296338 (-0.175099)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.399938 / 0.215209 (0.184729) 4.017665 / 2.077655 (1.940010) 1.782327 / 1.504120 (0.278207) 1.612955 / 1.541195 (0.071761) 1.698839 / 1.468490 (0.230349) 0.706702 / 4.584777 (-3.878075) 4.533425 / 3.745712 (0.787713) 2.102611 / 5.269862 (-3.167250) 1.461429 / 4.565676 (-3.104248) 0.085719 / 0.424275 (-0.338556) 0.012104 / 0.007607 (0.004497) 0.507397 / 0.226044 (0.281352) 5.061572 / 2.268929 (2.792643) 2.272106 / 55.444624 (-53.172518) 1.935575 / 6.876477 (-4.940901) 2.102541 / 2.142072 (-0.039532) 0.838395 / 4.805227 (-3.966832) 0.168573 / 6.500664 (-6.332091) 0.064234 / 0.075469 (-0.011235)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.190077 / 1.841788 (-0.651710) 15.765587 / 8.074308 (7.691279) 14.694626 / 10.191392 (4.503234) 0.142912 / 0.680424 (-0.537512) 0.017669 / 0.534201 (-0.516532) 0.421502 / 0.579283 (-0.157781) 0.452732 / 0.434364 (0.018368) 0.497480 / 0.540337 (-0.042857) 0.586310 / 1.386936 (-0.800626)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007629 / 0.011353 (-0.003724) 0.005330 / 0.011008 (-0.005679) 0.076366 / 0.038508 (0.037858) 0.034703 / 0.023109 (0.011593) 0.356300 / 0.275898 (0.080402) 0.392909 / 0.323480 (0.069429) 0.005959 / 0.007986 (-0.002026) 0.004140 / 0.004328 (-0.000188) 0.075289 / 0.004250 (0.071039) 0.047880 / 0.037052 (0.010828) 0.357289 / 0.258489 (0.098800) 0.404554 / 0.293841 (0.110714) 0.037182 / 0.128546 (-0.091365) 0.012266 / 0.075646 (-0.063380) 0.088554 / 0.419271 (-0.330718) 0.049698 / 0.043533 (0.006165) 0.353453 / 0.255139 (0.098314) 0.373252 / 0.283200 (0.090052) 0.101892 / 0.141683 (-0.039791) 1.481534 / 1.452155 (0.029380) 1.553818 / 1.492716 (0.061102)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.229891 / 0.018006 (0.211884) 0.452444 / 0.000490 (0.451954) 0.000434 / 0.000200 (0.000234) 0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030170 / 0.037411 (-0.007241) 0.115097 / 0.014526 (0.100571) 0.122094 / 0.176557 (-0.054463) 0.171352 / 0.737135 (-0.565784) 0.128441 / 0.296338 (-0.167898)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.428347 / 0.215209 (0.213138) 4.266243 / 2.077655 (2.188588) 2.148327 / 1.504120 (0.644207) 1.874141 / 1.541195 (0.332946) 1.968737 / 1.468490 (0.500246) 0.715320 / 4.584777 (-3.869457) 4.166097 / 3.745712 (0.420384) 2.169550 / 5.269862 (-3.100312) 1.377441 / 4.565676 (-3.188236) 0.086376 / 0.424275 (-0.337899) 0.012018 / 0.007607 (0.004411) 0.517433 / 0.226044 (0.291388) 5.167327 / 2.268929 (2.898398) 2.545822 / 55.444624 (-52.898803) 2.241726 / 6.876477 (-4.634751) 2.327220 / 2.142072 (0.185147) 0.841618 / 4.805227 (-3.963609) 0.169473 / 6.500664 (-6.331191) 0.065505 / 0.075469 (-0.009964)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.270476 / 1.841788 (-0.571312) 17.049885 / 8.074308 (8.975577) 14.847615 / 10.191392 (4.656223) 0.168671 / 0.680424 (-0.511753) 0.017564 / 0.534201 (-0.516637) 0.424780 / 0.579283 (-0.154503) 0.517392 / 0.434364 (0.083028) 0.561197 / 0.540337 (0.020859) 0.697792 / 1.386936 (-0.689144)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants