Add FMS datasets #1

daviswer · 2024-05-21T20:31:48Z

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.

Add experimental dataset source file
Add experimental dataloader builder, hooked into torchtitan cfg
Update torchtitan cfg with additional dataset arg fields
Update train script to build experimental dataloader instead of hf depending on cfg flags
Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
TODO: port over unit tests as well
TODO: preprocessing script(s) for the new dataset format

…s-datasets

daviswer and others added 26 commits May 21, 2024 16:29

Add datasets, dataloader, swap out, update cfg

db20ea4

Actually add dataset file

9dbe4be

Update llama3_8b.toml

79cbe06

Update llama3_8b.toml

18f6c0d

Update llama3_8b.toml

9c94d3a

Update fms_datasets.py

465e972

Cast inputs/targs to long

2fd356a

Update llama3_8b.toml

b967c58

Update data ckp to full path

5baa18e

Merge branch 'fms-datasets' of github.com:daviswer/torchtitan into fm…

ca8503f

…s-datasets

Update llama3_8b.toml

48e2cd1

Update llama3_8b.toml

db79072

Reconcile ckp naming schemes

2597c05

Swap ckp folder names from step_ to step-

2bbed13

add llama3-tokenized c4 mini

ad201e7

Update llama3_8b.toml

5784c76

Update llama3_8b.toml

5664b8c

add llama2 tokenized data

66e7b2b

Merge branch 'main' into fms-datasets

dbf26e0

Make dataloader usage flaggable, remove refs to fms

de705af

Add back old datapath flag, separate out new flags in cfg

1f1fdc9

up

e79ec4f

up

145ab47

up

c4e6b0b

up

b0eae7d

up

d45ed13

daviswer merged commit 2e733d4 into main May 31, 2024

daviswer deleted the fms-datasets branch May 31, 2024 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FMS datasets #1

Add FMS datasets #1

daviswer commented May 21, 2024 •

edited

Loading

Add FMS datasets #1

Add FMS datasets #1

Conversation

daviswer commented May 21, 2024 • edited Loading

daviswer commented May 21, 2024 •

edited

Loading