-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add contributing guidelines #525
Conversation
[ghstack-poisoned]
ghstack-source-id: 4c48f7af54500e4593550680c1b0688687e33e55 Pull Request resolved: #525
CONTRIBUTING.md
Outdated
- 2D FSDP + TP – If 1D FSDP does not suffice to make comparisons due to limited scalability. For example, this should be the baseline when experimenting with 3D parallelisms on the Llama 3.1 405B model. | ||
|
||
#### Performance | ||
- Memory and WPS / MFU, which are available from logging, should meet expectations. If necessary, verify the numbers on jobs spanning multiple nodes (e.g. on 64 GPUs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expectations for the performance? Accuracy is easy to justify (the same or the loss curve is comparable). What's the WPS / MFU expectation? For example, if a technique can reduce the memory with no better or worse WPS / MFU, the feature should still be acceptable because it allows to train larger models.
My suggestions is to clarify the expectation of WPS / MFU. If a feature is intended to address memory issue, it may not be the best to compare with the best MFU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the WPS / MFU expectation? For example, if a technique can reduce the memory with no better or worse WPS / MFU, the feature should still be acceptable because it allows to train larger models.
Agree that for different techniques, the expectation should be different. This doesn't mean we can accept arbitrary WPS/MFU regression as long as there is any memory reduction. E.g. what if a technique is "supposed" to show 80% memory reduction with 10% MFU regression, but the PR only shows 40% memory reduction with 5% MFU regression. Should we accept the PR? Clearly not.
IMO the bar might need to be set case-by-case, and as long as the PR owner can justify it (whether by comparing with other similar implementations, or achieving theoretically optimal performance, etc.), we should accept. It's like a research paper should have a solid experiment section, and reviewers can make the judgement.
My suggestions is to clarify the expectation of WPS / MFU. If a feature is intended to address memory issue, it may not be the best to compare with the best MFU.
Sounds good, let me add something.
As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service. This PR also updates some other docs in terms of formatting and removing outdated info. [ghstack-poisoned]
ghstack-source-id: 1d1212ad0c9bedb813e640bab853df2433fcd415 Pull Request resolved: #525
As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service. This PR also updates some other docs in terms of formatting and removing outdated info. [ghstack-poisoned]
ghstack-source-id: ab3631ecd961afcef64234ef74e965b3ffd4ce67 Pull Request resolved: #525
As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service. This PR also updates some other docs in terms of formatting and removing outdated info. [ghstack-poisoned]
ghstack-source-id: 11610f286e9cf7b17ac781f1b02291bcc59254c1 Pull Request resolved: #525
* [torchtitan/checkpoint.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints | ||
* [torchtitan/float8.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/float8.py) - utils for applying Float8 techniques | ||
* [torchtitan/models/llama/model.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py) - the Llama model definition (shared for Llama2 and Llama3 variants) | ||
* [train.py](train.py) - the main training loop and high-level setup code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh- nice- didn't know you could link this way :)
|
||
### Principles of contribution | ||
|
||
- Apply PyTorch-native training techniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should justify more why we have this principle, or its enough to just state it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the repo is sitting under pytorch lol?
I think this is beyond the scope of this PR. E.g. in the README first sentence, we set the tone "torchtitan
is a proof-of-concept for Large-scale LLM training using native PyTorch." Maybe we can rethink about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM!
As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service. This PR also updates some other docs in terms of formatting and removing outdated info. [ghstack-poisoned]
ghstack-source-id: 3ece57ae6d8dbf7ff66e3c41f1804ddb08078ba4 Pull Request resolved: #525
ghstack-source-id: 3ece57ae6d8dbf7ff66e3c41f1804ddb08078ba4 Pull Request resolved: #525
Stack from ghstack (oldest at bottom):
As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service.
This PR also updates some other docs in terms of formatting and removing outdated info.