Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否可以提供int4量化的版本? #15

Closed
mjzcng opened this issue Jun 5, 2024 · 23 comments
Closed

是否可以提供int4量化的版本? #15

mjzcng opened this issue Jun 5, 2024 · 23 comments
Assignees

Comments

@mjzcng
Copy link

mjzcng commented Jun 5, 2024

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况,但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型?

@zRzRzRzRzRzRzR
Copy link
Member

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况,但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型?

load in 4bit即可 我们也是这么跑的测试

@triumph
Copy link

triumph commented Jun 5, 2024

basic_demo/openai_api_server.py 如何支持 load in 4bit 呢?

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Jun 5, 2024
@zRzRzRzRzRzRzR
Copy link
Member

哦 vllm的话不行,因为openai api_server已经默认vllm底座了
关于vllm,确实不能加载现在这种4bit
具体issue可以看到 vllm-project/vllm#4033

@triumph
Copy link

triumph commented Jun 5, 2024

trans_web_demo.py 如何支持 load in 4bit 呢?

@M1saka10010
Copy link

4bit量化还会有跟glm3一样的输出混乱问题吗?

@shams2023
Copy link

4bit量化还会有跟glm3一样的输出混乱问题吗?

是怎么使用4bit量化的?大佬(有代码截图吗)

@M1saka10010
Copy link

4bit量化还会有跟glm3一样的输出混乱问题吗?

是怎么使用4bit量化的?大佬(有代码截图吗)

#15 (comment)

@shams2023
Copy link

4bit量化还会有跟glm3一样的输出混乱问题吗?

是怎么使用4bit量化的?大佬(有代码截图吗)

#15 (comment)

谢谢

@galena01
Copy link

galena01 commented Jun 5, 2024

测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

@shams2023
Copy link

shams2023 commented Jun 5, 2024

测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

这个导出来的权重是int4权重吗?(下面就是我使用你的代码导出来的模型权重)
image
有了权重之后我使用如下代码,还是出错了(代码如下:)
image

@swordfar
Copy link

swordfar commented Jun 5, 2024

测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错,测试了下,直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

@galena01
Copy link

galena01 commented Jun 5, 2024

@shams2023 您好,没错导出的是int4权重,我这里测试大约要8GB显存才能正常加载模型,推理的话显存占用还要多一些(取决于上下文长度)

@maxin9966
Copy link

@zRzRzRzRzRzRzR 请问glm-4-9b能否使用autogptq或者autoawq导出对应的量化模型?

@wh336699
Copy link

wh336699 commented Jun 6, 2024

@shams2023

model = AutoModelForCausalLM.from_pretrained(
"/home/wanhao/project/ChatGLM-9B/GLM-4/GLM-4-INT8/glm-4-9b-chat-GPTQ-Int8",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
# load_in_4bit = True
).eval()

去掉to(device)

@zs001122
Copy link

zs001122 commented Jun 6, 2024

测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错,测试了下,直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

transformers用的是哪个版本

@swordfar
Copy link

swordfar commented Jun 6, 2024

测试了一下,bitsandbytes量化是可以用的,用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错,测试了下,直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

transformers用的是哪个版本

transformers 4.41.2

@Qubitium
Copy link

Qubitium commented Jun 6, 2024

We got you covered with AutoGptq based 4bit quants.

AutoGPTQ/AutoGPTQ#683

@thomashooo
Copy link

mac支持量化吗,运行提示:
(chatglm) thomas@bogon basic_demo % python3 trans_to_4bit.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/Users/thomas/Documents/Projects/AI/GLM-4/basic_demo/trans_to_bit.py", line 27, in
model = AutoModelForCausalLM.from_pretrained(
File "/Users/thomas/miniconda3/envs/chatglm/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/Users/thomas/miniconda3/envs/chatglm/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3030, in from_pretrained
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

@Qubitium
Copy link

Qubitium commented Jun 7, 2024

@thomashooo Autogptq inference does not support GPTQ quants on mac. You need to check with llamacpp to see they have a gptq kernel written for metal (apple).

@shudct
Copy link

shudct commented Aug 5, 2024

哦 vllm的话不行,因为openai api_server已经默认vllm底座了 关于vllm,确实不能加载现在这种4bit 具体issue可以看到 vllm-project/vllm#4033

vllm目前支持awq,gptq,GLM4-9B可以提供这两种量化的版本吗

@Qubitium
Copy link

Qubitium commented Aug 5, 2024

vllm目前支持awq,gptq,GLM4-9B可以提供这两种量化的版本吗

https://huggingface.co/ModelCloud

@xiny0008
Copy link

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况,但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型?

load in 4bit即可 我们也是这么跑的测试

为什么量化后的模型是在CPU上的,是因为我的cuda版本不对吗?我使用的是12.0

@alexw994
Copy link

可以看看我的这个pr vllm-project/vllm#7672

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests