Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample Range for Metadataset #13

Open
lvoegtle opened this issue Sep 24, 2024 · 1 comment
Open

Sample Range for Metadataset #13

lvoegtle opened this issue Sep 24, 2024 · 1 comment
Labels

Comments

@lvoegtle
Copy link
Collaborator

This feature could offer multiple use-cases:

  • Limiting the training dataset to a small subset for debugging
  • Post-splitting a dataset into val/train/test
  • Changing sampling weights for a subset

This would be implemented like:

__module__: megatron.energon
__class__: Metadataset
splits:
  train:
    datasets:
      - sample_range: [0, 1000]
        path: ...
      - ...

Setting the range of the dataset to consider to the first 1k sample. Sample indexes are computed by concatenating the webdataset shards (from the split.yaml).

This feature only works for the innermost level, i.e. not for nested metadatasets (counting for that is not clear, because nested datasets would need to be concatenated, which is not the case for training).

Currently, for get_val_dataset, there already exists a limit option, which is different: That only limits the iterations (i.e. batches) to that number, which gives different samples than setting the sample range (range is based on sample index, the iterations may interleave different portions of the dataset).

@voegtlel
Copy link
Collaborator

There also was a request to allow splitting the dataset by samples for train/val/test (not only by whole shards).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants