Skip to content

Commit

Permalink
Update on "modify data split to use HF api"
Browse files Browse the repository at this point in the history
Just found out that HF dataset has its own [API](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node) to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting `streaming=True` for [load_dataset](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset) returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.


[ghstack-poisoned]
  • Loading branch information
tianyu-l committed Feb 21, 2024
1 parent 869c684 commit 5ebe2e7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion torchtrain/datasets/alpaca.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def __iter__(self):
sample_tokens = self._tokenizer.encode(sample_text, bos=True, eos=True)
all_tokens.extend(sample_tokens)

if len(all_tokens) >= max_buffer_token_len:
while len(all_tokens) >= max_buffer_token_len:
x = torch.LongTensor(all_tokens[:max_buffer_token_len])
# batched_x = x.reshape(self.batch_size, -1)
# update tokens to the remaining tokens
Expand Down

0 comments on commit 5ebe2e7

Please sign in to comment.