Quantization of Hugging Face models using llama.cpp

0. Install make, gcc and git-lfs in the system

Create a directory named os_model and move to that directory.

mkdir os_model
cd os_model

Create a virtual environment named osenv(do it according to your OS).

  python -m venv osenv
  source osenv/bin/activate

Install Git Large File System(lfs) and clone the sqlcoder-34b repo from hugging face. This takes a lot of time because the remote repo is nearly 140GB. Be patient and do not cancel the download.

git lfs install
git clone https://huggingface.co/defog/sqlcoder-34b-alpha

Download only tokenizer.model file from the following repo https://huggingface.co/defog/sqlcoder-7b/tree/main and place in the sqlcoder-34b-alpha folder.
Clone llama.cpp and setup python conversion stuff

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

Create a directory called sqlcoder-34b-alpha in the models directory located in llama.cpp folder
Run the following commands

make
python3 convert.py <model_folder../os_model/sqlcoder-34b-alpha> --outfile ./models/sqlcoder-34b-alpha/ggml-sqlcoder-34b-f16.gguf --outtype f16

Now let's quantize the above generated model to q4_k. You can quantize it to q8_0 for better performance(to do it replace q4_k with q8_0 in the below code).

./quantize ./models/sqlcoder-34b-alpha/ggml-sqlcoder-34b-f16.gguf ./models/sqlcoder-34b-alpha/ggml-sqlcoder-34b-q4_k.gguf.bin q4_k

We can use this model for various text-generation tasks. The model downloaded performs well for the task of natural language to SQL query generation.