-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Baichuan2 models #1022
Conversation
Withou normhead, we can norm the lmhead weight when loading tensor. Baichuan and Baichuan2's alibi mask is different from vllm's implementation which cause divergence of outputs. |
Baichuan2's alibi mask code looks different , but the result is same(same problem)? |
yes, we should only norm it when load the tensor once in load_weights(), just modify the original Baichuan model to do it when the model version is 2 |
No, the result of fpl16 is the same. However alibi masks under bf16 is different, espcially context is long (for example 4096) |
the implement of alibi mask in VLLM, GPT-Neox and MPT just follow the original paper, which are better than the implement of Baichuan and Bloom. |
|
Tried to hand calculate Baichuan2's alibi mask vs Baichuan1 in vLLM. Overall it doesn't look drastically different. Baichuan2's alibi slopes are only slightly different than that in the vLLM implementation in Baichuan1 _get_alibi_slopes() (due to floating point precision?)
Mask generation looks fine
The mask generation process of Baichuan 2's alibi mask is consistent with 1. Why they modified the implementation is beyond me though. |
Thanks for the suggestion, I believe it is also mentioned in the migration guide: https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#migrating-inference-optimizations-from-baichuan-1-to-baichuan-2 I have pushed a new commit adding norm head, please check the updated PR comment. |
VLLM generate the mask like [-3000, -3016....-2.5,-1.1, 0] , the precision of -1~-2 is high,which belong to the closer token. But Baichuan's mask looks like [0,1,......3000,3016,3016], the precision of 3000 is very low in bf16 format , interval is 16. the small number with higher precision are faraway from current token. |
vllm/model_executor/model_loader.py
Outdated
@@ -15,6 +15,8 @@ | |||
"AquilaModel": AquilaForCausalLM, | |||
"BaiChuanForCausalLM": BaiChuanForCausalLM, # baichuan-7b | |||
"BaichuanForCausalLM": BaichuanForCausalLM, # baichuan-13b | |||
"BaiChuan2ForCausalLM": BaiChuan2ForCausalLM, # baichuan-7b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool work!! However I still have a small question about the PR.
https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/d022d7264467b2c3bc483e7a3a17105dedba50b8/modeling_baichuan.py#L536
https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/d022d7264467b2c3bc483e7a3a17105dedba50b8/config.json#L8
According to baichuan2 offical code, they still call their model BaichuanForCausalLM.
Does that mean, if we directly use baichuan2 model download from HF repo, vllm will never load the code for Baichuan2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I pushed a workaround by comparing vocab_size to decide which Baichuan version to call.
in _get_model_architecture()
# baichuan 2 has different vocab size
if ("baichuan" in arch.lower()) and (getattr(config, "vocab_size")
== 125696):
return Baichuan2ForCausalLM
…ize in get_model_architecture
When norm fp16 weight, it will raise |
Interesting, maybe we load weights differently? Anyways it's meant for printing out the norms before & after norm head operation for a quick visual inspection, one can simply comment out these two print statements. I've already done so in my latest commit. You can try and re-fetch this PR.
|
I noticed that the function is ended with |
Oh yes, I loaded the finetuned weights by my own. I trained it using deepspeed+lora, and finally merge the adapter into the origin model. So maybe it will change the data type. The official weights dosen't raise that error. |
Hi @WoosukKwon , Could you please review this PR @garyfanhku #1022. The code seems to be without issues. Please review when possible.Thanks! |
I've updated the code to support both Baichuan2-7B and 13B, thanks to the revision proposed in #1092 Cheers. |
Generated text: 'Hello, my name is [your name]. Nice to meet you!' | ||
Prompt: None, | ||
Generated text: 'The current president of the United States is Joe Biden, who was sworn into office on January 20, 2021.' | ||
>>>>>> Baichuan2-13B-Chat 8Bit Demo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of 8Bit here? I see that the model you loaded above is Baichuan2-13B-Chat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're just for comparing vllm's output with HF model's output. A consistency check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
if getattr(config, "intermediate_size") == 11008: | ||
return BaiChuan2ForCausalLM | ||
elif getattr(config, "intermediate_size") == 13696: | ||
return Baichuan2ForCausalLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if getattr(config, "intermediate_size") == 11008:
arch = "BaiChuan2ForCausalLM"
elif getattr(config, "intermediate_size") == 13696:
arch = "Baichuan2ForCausalLM"
Is this better? Since you have already added it in _MODEL_REGISTRY
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems equivalent to me. Any particular benefits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equal, but if you return directly, there is no need to add it to _MODEL_REGISTRY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Hi, Gary, perhaps you should consider modifying the specific line of the original code instead of rewriting it entirely. I believe this would be better for facilitating the merging of your code. |
@JaheimLee @garyfanhku I also encountered this problem, what is the solution? Can you guys provide me with some ideas? Thank you so much! |
@jugglq Please make sure your local repo is up to date. Or check if the print statements mentioned above were commented out. |
I also loaded the finetuned weights by lora like @JaheimLee. Finally found a solution and it worked! loaded_weight = convert_pyslice_to_tensor(loaded_weight)
loaded_weight = loaded_weight.to('cuda')
|
@exceedzhang @yinjuxin Hi, have you checked this solution(#1403 ), I faced similar situation when tried to load baichuan1, good luck! |
It's an issue related to Baichuan/InternLM. I checked that vLLM bumped requirements for transformers to 4.34 to accommodate for Mistral. If that doesn't matter to you, I would suggest downgrading vLLM. Check this for more info baichuan-inc/Baichuan2#226 |
@garyfanhku Hi gary, I noticed that they have made some changes(like they deleted tensor_parallel dir and etc...) since this commit:ba0bfd4 , so you'd better adjust your code for the new version vllm so they can merge them, I guess? |
It appears modifying baichuan2.py following the changes in
|
Sure, I would be glad to offer help~ |
I try to rebase the code to the latest version by modifying baichuan2.py, and it currently works for baichuan2 models. |
Baichuan2 repo provides a script, which can convert baichuan2 weights into baichuan1.
|
Closing the PR as the model is supported by vLLM. |
@WoosukKwon Baichuan2 is still not supported. The support model is Baichuan. |
Added Baichuan2 model and config, registered Baichuan2 as a new model. Added an offline inference example for validating generation outputs with models using chat format.
Notes:
Normhead() is yet to be implemented.Added NormHead in load_weight() per suggestion from @nexa123, although it does not affect the output during my testing.(*Potential Bug) Text generation output seems to be prepended by a whitespace.The prepending whitespace is caused by prompts not formatted in chat format. Details see below.LLM.generate()
does not seem to handle chat formatted prompts like[{"role":"user", "content":"..."}]
, which is adopted by Baichuan 1&2 models. Therefore I added an exampleoffline_inference_baichuan.py
, fixing prompt formats by copyingbuild_chat_input()
from Baichuan2 repo and passing in prompt_token_ids directly. I also tested the output consistency against non-vLLM pipelines. vLLM generated outputs look largely consistent with Baichuan2's local inference results. Below is an example comparison.