Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starcoder / Quantized Issues #1

Closed
bluecoconut opened this issue May 15, 2023 · 10 comments
Closed

Starcoder / Quantized Issues #1

bluecoconut opened this issue May 15, 2023 · 10 comments

Comments

@bluecoconut
Copy link

Hey! Thanks for this library, I really appreciate the API and simplicity you are bringing to this, it's exactly what I was looking for in trying to integrate ggml models into python! (specifically into my library lambdaprompt.

One issue, it seems like there's something going wrong with starcoder quantized models.
For the full model, it seems to work great, and I'm getting the same outputs it seems.

What works (full model weights):

 ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

as equivalent to:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    '/workspaces/research/models/starcoder/starcoder-ggml.bin',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=30, top_k=0, top_p=0.95, temperature=0.2))

Seem to give equivalent results!

What fails (quantized model weights):

However, when I change to the quantized model (to reproduce the same as this)

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

I get a core dumped ggml error

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctransformers import AutoModelForCausalLM
>>> llm = AutoModelForCausalLM.from_pretrained(
...     '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin',
...     model_type='starcoder')
GGML_ASSERT: /home/runner/work/ctransformers/ctransformers/models/ggml/src/ggml.c:4408: wtype != GGML_TYPE_COUNT
Aborted (core dumped)
@marella
Copy link
Owner

marella commented May 15, 2023

Hi, I think it is due to the breaking change introduced in the quantization formats in the GGML library in ggerganov/ggml#154 yesterday.

Can you please try doing the quantization from the ggml submodule of this repo and let me know if it works:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers/models/ggml

cmake -S . -B build
cmake --build build

./build/bin/starcoder-quantize # specify path to model and quantization type

If I pull the latest changes, I think old models will stop working with this library. So I'm thinking of waiting for sometime for people to convert and provide models in the new format before pulling the changes.

@bluecoconut
Copy link
Author

Awesome! Thank you @marella this is definitely the issue.

I wish there were more clear ways to version the various quantizations -- I'm new to the ggml toolkit and so I didn't realize how breaking changes to the quantization would manifest.

I'll also add, the ability to pull directly from huggingface makes this super great, thank you!

@bgonzalezfractal
Copy link

bgonzalezfractal commented May 20, 2023

Hi @bluecoconut @marella coudl you provide an exmaple, been trying to execute the model but have no luck, none of this prompts work:

/build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin 3
./build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin --type=3

before adding type I was doing:

./build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

And getting:

usage: ./build/bin/starcoder-quantize model-f32.bin model-quant.bin type
  type = "q4_0" or 2
  type = "q4_1" or 3
  type = "q4_2" or 5
  type = "q5_0" or 8
  type = "q5_1" or 9
  type = "q8_0" or 7

I don't have a specific problem using the module from transformers, but in a Mac M1 Pro with 64 GB memory inference can take more than 10 minutes, is this correct?
image

@marella
Copy link
Owner

marella commented May 20, 2023

Hi @bgonzalezfractal, starcoder-quantize binary is for quantizing models. For text generation you should use starcoder binary:

./build/bin/starcoder -m ./starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

Apple M1 processor doesn't support AVX2/AVX instructions so it will be slower. You can try increasing the threads parameter to the number of physical cpu cores your system has. For example, if your system has 8 cores, try using threads=7 or threads=8:

llm(..., threads=8)

You can use this command to get cpu cores count:

grep -m 1 'cpu cores' /proc/cpuinfo

Also I just updated the build file in this repo. Can you please pull the latest changes or clone this repo and try building the library from source:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers

cmake -S . -B build
cmake --build build

The compiled binary for the library will be located at build/lib/libctransformers.dylib which can be used as:

llm = AutoModelForCausalLM.from_pretrained(..., lib='/path/to/ctransformers/build/lib/libctransformers.dylib')

llm(..., threads=8)

Can you please try this and let me know if you are seeing any improvement in performance.

@marella
Copy link
Owner

marella commented May 20, 2023

@bluecoconut Just an FYI: There is a new breaking change in quantization formats added to llama.cpp in ggerganov/llama.cpp#1508 yesterday. Initially I was planning to update to the latest version over the weekend but now I will have to wait for the new breaking changes to be added to the ggml repo.

@bgonzalezfractal
Copy link

bgonzalezfractal commented May 20, 2023

@marella I was able to get the model tu run with the command line, but I have no success using the library, I get "segmentation fault python" using transformers.

For the build I used:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers/models

cmake -S . -B build
cmake --build build

Since the Cmake files were placed in the models folders.
Using python from the command line or jupyter notebooks, I'm unable to run the models

image

When building ggml from source tag v0.1.2, then text generations works fine:

image

@marella
Copy link
Owner

marella commented May 21, 2023

@bgonzalezfractal Recently I updated the GGML library which has breaking changes to quantization formats. So old models have to be re-quantized. Let's continue the discussion here.


@bluecoconut This is released in the latest version 0.2.0

It includes the latest quantization changes and the recent fix for StarCoder ggerganov/ggml#176 Since it includes breaking changes, old models have to be re-quantized.

It also supports LLaMA, MPT models now.

@bgonzalezfractal
Copy link

@marella Can you confirm these steps would work:

  1. First follow the starcode.cpp instructions to quantize model
git clone https://github.com/bigcode-project/starcoder.cpp
cd starcoder.cpp

# Convert HF model to ggml
python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

# Build ggml libraries (Should we build the new version of ggml, since ggml in tag v0.1.2 in ctransformers is the one working)
make

# quantize the model (either starcoder, llama or mpt?)
./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3
  1. Once quantized we can use it directly as follows:
./build/bin/starcoder -m $MODEL_PATH/starcoder-ggml-q4_1.bin -p "Write a pandas function that takes in a DataFrame with a 'price' column and calculates the average prices per month and year as new columns.Use matplotlib to plot the results and return the dataframe. \ndef calculate_df_avg_price(" --top_k 0 --top_p 0.95 --temp 0.2
  1. And it should work in ctransformers like this:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    f'{$MODEL_PATH}/starcoder-ggml-q4_1.bin ',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=200, top_k=0, top_p=0.95, temperature=0.2))

@marella
Copy link
Owner

marella commented May 21, 2023

I don't think starcoder.cpp repo has updated to the latest GGML version so it might not work.

The steps on GGML repo should work with the latest version of ctransformers: https://github.com/ggerganov/ggml/tree/master/examples/starcoder#quick-start

@bgonzalezfractal
Copy link

@marella, quantized starcoder again, works at 183.36 ms per token:
image
Nonetheless, with the latest version of ctransformers and python3.10, the llm loads but seems to be stuck on inference and does not produce any results:
image
Keeps using memory and CPU:
image

Any luck?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants