是否可以提供int4量化的版本？ #15

mjzcng · 2024-06-05T06:18:19Z

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况，但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型？

zRzRzRzRzRzRzR · 2024-06-05T06:23:29Z

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况，但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型？

load in 4bit即可我们也是这么跑的测试

triumph · 2024-06-05T06:45:35Z

basic_demo/openai_api_server.py 如何支持 load in 4bit 呢?

zRzRzRzRzRzRzR · 2024-06-05T06:51:47Z

哦 vllm的话不行，因为openai api_server已经默认vllm底座了
关于vllm，确实不能加载现在这种4bit
具体issue可以看到 vllm-project/vllm#4033

triumph · 2024-06-05T07:05:17Z

trans_web_demo.py 如何支持 load in 4bit 呢?

M1saka10010 · 2024-06-05T07:54:43Z

4bit量化还会有跟glm3一样的输出混乱问题吗？

shams2023 · 2024-06-05T07:58:56Z

4bit量化还会有跟glm3一样的输出混乱问题吗？

是怎么使用4bit量化的？大佬（有代码截图吗）

M1saka10010 · 2024-06-05T08:01:29Z

4bit量化还会有跟glm3一样的输出混乱问题吗？

是怎么使用4bit量化的？大佬（有代码截图吗）

#15 (comment)

shams2023 · 2024-06-05T08:05:58Z

4bit量化还会有跟glm3一样的输出混乱问题吗？

是怎么使用4bit量化的？大佬（有代码截图吗）

#15 (comment)

谢谢

galena01 · 2024-06-05T08:57:03Z

测试了一下，bitsandbytes量化是可以用的，用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

shams2023 · 2024-06-05T09:49:22Z

测试了一下，bitsandbytes量化是可以用的，用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

这个导出来的权重是int4权重吗？（下面就是我使用你的代码导出来的模型权重）

有了权重之后我使用如下代码，还是出错了（代码如下：）

swordfar · 2024-06-05T11:04:17Z

测试了一下，bitsandbytes量化是可以用的，用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错，测试了下，直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

galena01 · 2024-06-05T12:01:19Z

@shams2023 您好，没错导出的是int4权重，我这里测试大约要8GB显存才能正常加载模型，推理的话显存占用还要多一些（取决于上下文长度）

maxin9966 · 2024-06-05T19:41:14Z

@zRzRzRzRzRzRzR 请问glm-4-9b能否使用autogptq或者autoawq导出对应的量化模型？

wh336699 · 2024-06-06T09:26:43Z

@shams2023

model = AutoModelForCausalLM.from_pretrained(
"/home/wanhao/project/ChatGLM-9B/GLM-4/GLM-4-INT8/glm-4-9b-chat-GPTQ-Int8",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
# load_in_4bit = True
).eval()

去掉to(device)

zs001122 · 2024-06-06T09:40:53Z

测试了一下，bitsandbytes量化是可以用的，用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错，测试了下，直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

transformers用的是哪个版本

swordfar · 2024-06-06T15:20:36Z

测试了一下，bitsandbytes量化是可以用的，用这段代码就可以导出4-bit版本

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("./glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "./glm-4-9b-chat",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_4bit=True
).eval()
model.save_pretrained("glm-4-9b-chat-int4")
tokenizer.save_pretrained("glm-4-9b-chat-int4")

不错，测试了下，直接将直接将"load_in_4bit=True"加入hf.py文件是可以跑起来的

transformers用的是哪个版本

transformers 4.41.2

Qubitium · 2024-06-06T20:15:07Z

We got you covered with AutoGptq based 4bit quants.

AutoGPTQ/AutoGPTQ#683

thomashooo · 2024-06-07T01:52:05Z

mac支持量化吗，运行提示：
(chatglm) thomas@bogon basic_demo % python3 trans_to_4bit.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/Users/thomas/Documents/Projects/AI/GLM-4/basic_demo/trans_to_bit.py", line 27, in
model = AutoModelForCausalLM.from_pretrained(
File "/Users/thomas/miniconda3/envs/chatglm/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/Users/thomas/miniconda3/envs/chatglm/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3030, in from_pretrained
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

Qubitium · 2024-06-07T02:57:43Z

@thomashooo Autogptq inference does not support GPTQ quants on mac. You need to check with llamacpp to see they have a gptq kernel written for metal (apple).

shudct · 2024-08-05T08:28:53Z

哦 vllm的话不行，因为openai api_server已经默认vllm底座了关于vllm，确实不能加载现在这种4bit 具体issue可以看到 vllm-project/vllm#4033

vllm目前支持awq，gptq，GLM4-9B可以提供这两种量化的版本吗

Qubitium · 2024-08-05T14:21:54Z

vllm目前支持awq，gptq，GLM4-9B可以提供这两种量化的版本吗

https://huggingface.co/ModelCloud

xiny0008 · 2024-08-13T09:26:29Z

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况，但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型？

load in 4bit即可我们也是这么跑的测试

为什么量化后的模型是在CPU上的，是因为我的cuda版本不对吗？我使用的是12.0

alexw994 · 2024-08-20T09:14:48Z

可以看看我的这个pr vllm-project/vllm#7672

zRzRzRzRzRzRzR self-assigned this Jun 5, 2024

zRzRzRzRzRzRzR closed this as completed Jun 5, 2024

RichardFans mentioned this issue Jun 7, 2024

官方能出一个底层不是vllm的，支持int4的 openai_api_server_2 demo 程序吗？谢谢 #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

是否可以提供int4量化的版本？ #15

是否可以提供int4量化的版本？ #15

mjzcng commented Jun 5, 2024

zRzRzRzRzRzRzR commented Jun 5, 2024

triumph commented Jun 5, 2024

zRzRzRzRzRzRzR commented Jun 5, 2024

triumph commented Jun 5, 2024

M1saka10010 commented Jun 5, 2024

shams2023 commented Jun 5, 2024

M1saka10010 commented Jun 5, 2024

shams2023 commented Jun 5, 2024

galena01 commented Jun 5, 2024

shams2023 commented Jun 5, 2024 •

edited

Loading

swordfar commented Jun 5, 2024

galena01 commented Jun 5, 2024

maxin9966 commented Jun 5, 2024

wh336699 commented Jun 6, 2024

zs001122 commented Jun 6, 2024

swordfar commented Jun 6, 2024

Qubitium commented Jun 6, 2024

thomashooo commented Jun 7, 2024

Qubitium commented Jun 7, 2024

shudct commented Aug 5, 2024

Qubitium commented Aug 5, 2024

xiny0008 commented Aug 13, 2024

alexw994 commented Aug 20, 2024

是否可以提供int4量化的版本？ #15

是否可以提供int4量化的版本？ #15

Comments

mjzcng commented Jun 5, 2024

zRzRzRzRzRzRzR commented Jun 5, 2024

triumph commented Jun 5, 2024

zRzRzRzRzRzRzR commented Jun 5, 2024

triumph commented Jun 5, 2024

M1saka10010 commented Jun 5, 2024

shams2023 commented Jun 5, 2024

M1saka10010 commented Jun 5, 2024

shams2023 commented Jun 5, 2024

galena01 commented Jun 5, 2024

shams2023 commented Jun 5, 2024 • edited Loading

swordfar commented Jun 5, 2024

galena01 commented Jun 5, 2024

maxin9966 commented Jun 5, 2024

wh336699 commented Jun 6, 2024

zs001122 commented Jun 6, 2024

swordfar commented Jun 6, 2024

Qubitium commented Jun 6, 2024

thomashooo commented Jun 7, 2024

Qubitium commented Jun 7, 2024

shudct commented Aug 5, 2024

Qubitium commented Aug 5, 2024

xiny0008 commented Aug 13, 2024

alexw994 commented Aug 20, 2024

shams2023 commented Jun 5, 2024 •

edited

Loading