-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
是否可以提供int4量化的版本? #15
Comments
load in 4bit即可 我们也是这么跑的测试 |
basic_demo/openai_api_server.py 如何支持 load in 4bit 呢? |
哦 vllm的话不行,因为openai api_server已经默认vllm底座了 |
trans_web_demo.py 如何支持 load in 4bit 呢? |
4bit量化还会有跟glm3一样的输出混乱问题吗? |
是怎么使用4bit量化的?大佬(有代码截图吗) |
|
谢谢 |
测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本 import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)
query = "你好"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True
)
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
"./glm-4-9b-chat",
low_cpu_mem_usage=True,
trust_remote_code=True,
load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4") |
不错,测试了下,直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的 |
@shams2023 您好,没错导出的是int4权重,我这里测试大约要8GB显存才能正常加载模型,推理的话显存占用还要多一些(取决于上下文长度) |
@zRzRzRzRzRzRzR 请问glm-4-9b能否使用autogptq或者autoawq导出对应的量化模型? |
model = AutoModelForCausalLM.from_pretrained( 去掉to(device) |
transformers用的是哪个版本 |
transformers 4.41.2 |
We got you covered with AutoGptq based 4bit quants. |
mac支持量化吗,运行提示: |
@thomashooo Autogptq inference does not support GPTQ quants on mac. You need to check with llamacpp to see they have a gptq kernel written for metal (apple). |
vllm目前支持awq,gptq,GLM4-9B可以提供这两种量化的版本吗 |
|
为什么量化后的模型是在CPU上的,是因为我的cuda版本不对吗?我使用的是12.0 |
可以看看我的这个pr vllm-project/vllm#7672 |
我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况,但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型?
The text was updated successfully, but these errors were encountered: