-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming Beam datasets from HF GCS preprocessed data #5689
Support streaming Beam datasets from HF GCS preprocessed data #5689
Conversation
The documentation is not available anymore as the PR was closed or merged. |
In [1]: from datasets import load_dataset
In [2]: ds = load_dataset("wikipedia", "20220301.en", split="train", streaming=True); item = next(iter(ds)); item
Out[2]:
{'id': '12',
'url': 'https://en.wikipedia.org/wiki/Anarchism',
'title': 'Anarchism',
'text': 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement,...} |
I love your example 🏴 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing thanks !!
You could also have a simple integration test in test_hf_gcp.py for wikipedia to make sure it keeps working in the long run
src/datasets/builder.py
Outdated
return ExamplesIterable(self._generate_examples_from_hf_gcs, {"split": split}) | ||
|
||
def _generate_examples_from_hf_gcs(self, split): | ||
remote_prepared_filename = f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Beam builders may create sharded arrow files now.
It's the case for none of the datasets on GCP but if we regenerate a new one it may en up sharded.
You can check the dataset info.shard_lengths
to know if it's sharded or not and how many shards there are.
if self.info.splits[split].shard_lengths:
num_shards = len(self.info.splits[split].shard_lengths)
urls = [
f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.arrow"
for shard_id in range(num_shards)
]
else:
urls = [f"{self._remote_cache_dir_from_hf_gcs}/{self.name}-{split}.arrow"]
edit: fixed self.info.splits[split].shard_lengths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... Thanks.
Let me refactor the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the attribute shard_lengths
belongs to SplitInfo
(and not DatasetInfo
), I have refactored the method so that split is now of type SplitInfo
instead of str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! :)
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This PR implements streaming Apache Beam datasets that are already preprocessed by us and stored in the HF Google Cloud Storage:
This is done by streaming from the prepared Arrow files in HF Google Cloud Storage.
This will fix their corresponding dataset viewers. Related to:
Related to:
CC: @severo