Update on "modify data split to use HF api"

Just found out that HF dataset has its own [API](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node) to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks). Note: This is still a map-style dataset, that has to be loaded in memory. Setting `streaming=True` for [load_dataset](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset) returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower. [ghstack-poisoned]
pytorch · Feb 21, 2024 · 5ebe2e7 · 5ebe2e7
1 parent 869c684
commit 5ebe2e7
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/torchtrain/datasets/alpaca.py b/torchtrain/datasets/alpaca.py
@@ -62,7 +62,7 @@ def __iter__(self):
             sample_tokens = self._tokenizer.encode(sample_text, bos=True, eos=True)
             all_tokens.extend(sample_tokens)
 
-            if len(all_tokens) >= max_buffer_token_len:
+            while len(all_tokens) >= max_buffer_token_len:
                 x = torch.LongTensor(all_tokens[:max_buffer_token_len])
                 # batched_x = x.reshape(self.batch_size, -1)
                 # update tokens to the remaining tokens