Starcoder / Quantized Issues #1

bluecoconut · 2023-05-15T01:43:22Z

Hey! Thanks for this library, I really appreciate the API and simplicity you are bringing to this, it's exactly what I was looking for in trying to integrate ggml models into python! (specifically into my library lambdaprompt.

One issue, it seems like there's something going wrong with starcoder quantized models.
For the full model, it seems to work great, and I'm getting the same outputs it seems.

What works (full model weights):

 ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

as equivalent to:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    '/workspaces/research/models/starcoder/starcoder-ggml.bin',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=30, top_k=0, top_p=0.95, temperature=0.2))

Seem to give equivalent results!

What fails (quantized model weights):

However, when I change to the quantized model (to reproduce the same as this)

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

I get a core dumped ggml error

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctransformers import AutoModelForCausalLM
>>> llm = AutoModelForCausalLM.from_pretrained(
...     '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin',
...     model_type='starcoder')
GGML_ASSERT: /home/runner/work/ctransformers/ctransformers/models/ggml/src/ggml.c:4408: wtype != GGML_TYPE_COUNT
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

marella · 2023-05-15T14:48:43Z

Hi, I think it is due to the breaking change introduced in the quantization formats in the GGML library in ggerganov/ggml#154 yesterday.

Can you please try doing the quantization from the ggml submodule of this repo and let me know if it works:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers/models/ggml

cmake -S . -B build
cmake --build build

./build/bin/starcoder-quantize # specify path to model and quantization type

If I pull the latest changes, I think old models will stop working with this library. So I'm thinking of waiting for sometime for people to convert and provide models in the new format before pulling the changes.

bluecoconut · 2023-05-16T06:14:50Z

Awesome! Thank you @marella this is definitely the issue.

I wish there were more clear ways to version the various quantizations -- I'm new to the ggml toolkit and so I didn't realize how breaking changes to the quantization would manifest.

I'll also add, the ability to pull directly from huggingface makes this super great, thank you!

bgonzalezfractal · 2023-05-20T02:24:48Z

Hi @bluecoconut @marella coudl you provide an exmaple, been trying to execute the model but have no luck, none of this prompts work:

/build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin 3
./build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin --type=3

before adding type I was doing:

./build/bin/starcoder-quantize -m ./starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

And getting:

usage: ./build/bin/starcoder-quantize model-f32.bin model-quant.bin type
  type = "q4_0" or 2
  type = "q4_1" or 3
  type = "q4_2" or 5
  type = "q5_0" or 8
  type = "q5_1" or 9
  type = "q8_0" or 7

I don't have a specific problem using the module from transformers, but in a Mac M1 Pro with 64 GB memory inference can take more than 10 minutes, is this correct?

marella · 2023-05-20T11:59:07Z

Hi @bgonzalezfractal, starcoder-quantize binary is for quantizing models. For text generation you should use starcoder binary:

./build/bin/starcoder -m ./starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

Apple M1 processor doesn't support AVX2/AVX instructions so it will be slower. You can try increasing the threads parameter to the number of physical cpu cores your system has. For example, if your system has 8 cores, try using threads=7 or threads=8:

llm(..., threads=8)

You can use this command to get cpu cores count:

grep -m 1 'cpu cores' /proc/cpuinfo

Also I just updated the build file in this repo. Can you please pull the latest changes or clone this repo and try building the library from source:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers

cmake -S . -B build
cmake --build build

The compiled binary for the library will be located at build/lib/libctransformers.dylib which can be used as:

llm = AutoModelForCausalLM.from_pretrained(..., lib='/path/to/ctransformers/build/lib/libctransformers.dylib')

llm(..., threads=8)

Can you please try this and let me know if you are seeing any improvement in performance.

marella · 2023-05-20T12:06:54Z

@bluecoconut Just an FYI: There is a new breaking change in quantization formats added to llama.cpp in ggerganov/llama.cpp#1508 yesterday. Initially I was planning to update to the latest version over the weekend but now I will have to wait for the new breaking changes to be added to the ggml repo.

bgonzalezfractal · 2023-05-20T23:24:23Z

@marella I was able to get the model tu run with the command line, but I have no success using the library, I get "segmentation fault python" using transformers.

For the build I used:

git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers/models

cmake -S . -B build
cmake --build build

Since the Cmake files were placed in the models folders.
Using python from the command line or jupyter notebooks, I'm unable to run the models

When building ggml from source tag v0.1.2, then text generations works fine:

marella · 2023-05-21T02:47:27Z

@bgonzalezfractal Recently I updated the GGML library which has breaking changes to quantization formats. So old models have to be re-quantized. Let's continue the discussion here.

@bluecoconut This is released in the latest version 0.2.0

It includes the latest quantization changes and the recent fix for StarCoder ggerganov/ggml#176 Since it includes breaking changes, old models have to be re-quantized.

It also supports LLaMA, MPT models now.

bgonzalezfractal · 2023-05-21T03:05:07Z

@marella Can you confirm these steps would work:

First follow the starcode.cpp instructions to quantize model

git clone https://github.com/bigcode-project/starcoder.cpp
cd starcoder.cpp

# Convert HF model to ggml
python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

# Build ggml libraries (Should we build the new version of ggml, since ggml in tag v0.1.2 in ctransformers is the one working)
make

# quantize the model (either starcoder, llama or mpt?)
./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

Once quantized we can use it directly as follows:

./build/bin/starcoder -m $MODEL_PATH/starcoder-ggml-q4_1.bin -p "Write a pandas function that takes in a DataFrame with a 'price' column and calculates the average prices per month and year as new columns.Use matplotlib to plot the results and return the dataframe. \ndef calculate_df_avg_price(" --top_k 0 --top_p 0.95 --temp 0.2

And it should work in ctransformers like this:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    f'{$MODEL_PATH}/starcoder-ggml-q4_1.bin ',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=200, top_k=0, top_p=0.95, temperature=0.2))

marella · 2023-05-21T03:24:27Z

I don't think starcoder.cpp repo has updated to the latest GGML version so it might not work.

The steps on GGML repo should work with the latest version of ctransformers: https://github.com/ggerganov/ggml/tree/master/examples/starcoder#quick-start

bgonzalezfractal · 2023-05-21T18:27:57Z

@marella, quantized starcoder again, works at 183.36 ms per token:

Nonetheless, with the latest version of ctransformers and python3.10, the llm loads but seems to be stuck on inference and does not produce any results:

Keeps using memory and CPU:

Any luck?

bluecoconut closed this as completed May 16, 2023

bluecoconut mentioned this issue May 16, 2023

Local Mode fails on GGML models approximatelabs/sketch#22

Open

marella mentioned this issue May 17, 2023

New ggml llamacpp file format support #4

Closed

marella mentioned this issue May 21, 2023

Performance on Apple silicon #5

Closed

marella mentioned this issue May 22, 2023

Segmentation fault on m1 mac #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starcoder / Quantized Issues #1

Starcoder / Quantized Issues #1

bluecoconut commented May 15, 2023

marella commented May 15, 2023

bluecoconut commented May 16, 2023

bgonzalezfractal commented May 20, 2023 •

edited

Loading

marella commented May 20, 2023

marella commented May 20, 2023

bgonzalezfractal commented May 20, 2023 •

edited

Loading

marella commented May 21, 2023

bgonzalezfractal commented May 21, 2023

marella commented May 21, 2023

bgonzalezfractal commented May 21, 2023

Starcoder / Quantized Issues #1

Starcoder / Quantized Issues #1

Comments

bluecoconut commented May 15, 2023

What works (full model weights):

What fails (quantized model weights):

marella commented May 15, 2023

bluecoconut commented May 16, 2023

bgonzalezfractal commented May 20, 2023 • edited Loading

marella commented May 20, 2023

marella commented May 20, 2023

bgonzalezfractal commented May 20, 2023 • edited Loading

marella commented May 21, 2023

bgonzalezfractal commented May 21, 2023

marella commented May 21, 2023

bgonzalezfractal commented May 21, 2023

bgonzalezfractal commented May 20, 2023 •

edited

Loading

bgonzalezfractal commented May 20, 2023 •

edited

Loading