-
-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/llama-2 examples #319
feat/llama-2 examples #319
Conversation
@winglian fixed. Also added a lora example. |
Do these config files work out of the box without the changes listed here: #294 ? or will we need to wait for the suggested changes in that issue to be implemented? |
I have successfully trained a Llama-2 7B QLoRA on a 3090 using this and it seems to work. Thanks for this! |
Have also trained Llamav2 7B and 13B with this, both showing good improvements 👍 |
According to the llama2 page, they recommend adding the PAD token to the special token config. I think this is easy to add. We can use the one that axolotl hardcodes However, I'm unclear whether point 2 within #294 is necessary. |
@NanoCode012 |
Added the pad token. Not sure if it made a difference for training, but can confirm inference still works.
|
7b works fine on ml.g5.12xlarge Sagemaker |
I've successfully trained Llama-2 13b with this suggested QLora configuration and it worked well. I'm having some troubles with the 70b model though. It looks like our x-formers attention monkeypatch doesn't like the Grouped Query Attention that the 70B model uses. I also got errors trying to run it with FlashAttention instead of x-formers attention. One other comment: the way we've added the pad token at the moment sets both and to token 0 in the tokenizer. I doubt that this causes any real issues in practise, but I noticed that in the debug output. All in all though, the new configs work for the smaller models which is great! (For reference, here's the stack trace from the 70B model training attempt - just tried this again with the latest docker container on Runpod - slightly different error to the old one): |
Further to my comment about the 70B model, this looks very similar to what people are experiencing on FastChat here: lm-sys/FastChat#2075 . It looks like someone over there got it working by updating one of their dependencies - I'm asking for more info at the moment. (Would people prefer if I move this into its own issue rather than mucking up the Pull Request?) |
Yes it would be helpful to have a new issue about 70B specific problems |
* qlora llama-2 * qlora llama-2 * linting * readme * lora added * linting * change group_by_length * 13b fitting on 24gb * grouped lengths true * add pad token * change out dir --------- Co-authored-by: Mads Henrichsen <mads@Brbar-tilhrende-Mads.local>
* qlora llama-2 * qlora llama-2 * linting * readme * lora added * linting * change group_by_length * 13b fitting on 24gb * grouped lengths true * add pad token * change out dir --------- Co-authored-by: Mads Henrichsen <mads@Brbar-tilhrende-Mads.local>
Example of qlora training llama-2 7b