Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible scaling for stop/resume #58

Open
philipp-fischer opened this issue Feb 7, 2025 · 0 comments
Open

Reproducible scaling for stop/resume #58

philipp-fischer opened this issue Feb 7, 2025 · 0 comments
Labels

Comments

@philipp-fischer
Copy link
Collaborator

The docs currently state that one can continue the training with a different node count, but actually it only works when re-running the training.

We should add a function that the user can call before restoring the state to re-distribute the saved state to a new parallel config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant