Sample Range for Metadataset #13

lvoegtle · 2024-09-24T09:56:02Z

This feature could offer multiple use-cases:

Limiting the training dataset to a small subset for debugging
Post-splitting a dataset into val/train/test
Changing sampling weights for a subset

This would be implemented like:

__module__: megatron.energon
__class__: Metadataset
splits:
  train:
    datasets:
      - sample_range: [0, 1000]
        path: ...
      - ...

Setting the range of the dataset to consider to the first 1k sample. Sample indexes are computed by concatenating the webdataset shards (from the split.yaml).

This feature only works for the innermost level, i.e. not for nested metadatasets (counting for that is not clear, because nested datasets would need to be concatenated, which is not the case for training).

Currently, for get_val_dataset, there already exists a limit option, which is different: That only limits the iterations (i.e. batches) to that number, which gives different samples than setting the sample range (range is based on sample index, the iterations may interleave different portions of the dataset).

The text was updated successfully, but these errors were encountered:

voegtlel · 2025-01-14T15:54:28Z

There also was a request to allow splitting the dataset by samples for train/val/test (not only by whole shards).

philipp-fischer added the P2 label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample Range for Metadataset #13

Sample Range for Metadataset #13

lvoegtle commented Sep 24, 2024

voegtlel commented Jan 14, 2025

Sample Range for Metadataset #13

Sample Range for Metadataset #13

Comments

lvoegtle commented Sep 24, 2024

voegtlel commented Jan 14, 2025