torchtrain infrastructure building
Closed Oct 18, 2024
100% complete
Couple of core infra we need to build:
- enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc)
- enable checkpoint save/load
- metrics collecting (i.e. wps, memory usage, loss value)
- Add TensorBoard for visualization like losses
- testing, add more tests
Couple of core infra we need to build:
- enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc)
- enable checkpoint save/load
- metrics collecting (i.e. wps, memory usage, loss value)
- Add TensorBoard for visualization like losses
- testing, add more tests
This milestone is closed.
No open issues remain. View closed issues or see open milestones in this repository.