Skip to content

torchtrain infrastructure building

Closed Oct 18, 2024 100% complete

Couple of core infra we need to build:

  • enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc)
  • enable checkpoint save/load
  • metrics collecting (i.e. wps, memory usage, loss value)
  • Add TensorBoard for visualization like losses
  • testing, add more tests

Couple of core infra we need to build:

  • enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc)
  • enable checkpoint save/load
  • metrics collecting (i.e. wps, memory usage, loss value)
  • Add TensorBoard for visualization like losses
  • testing, add more tests

This milestone is closed.

No open issues remain. View closed issues or see open milestones in this repository.