Local Llama 3 Chat 🦙

A simple chat application with LLama 3 using OpenVINO Runtime for inference and transformers library for tokenization.

Model Export
Quickstart with Docker
Requirements
Getting Started
Export from HuggingFace

Model Export

Download the INT-4 quantized Meta-Llama-3.1-8B-Instruct model already converted to the OpenVINO IR format from HuggingFace using huggingface-cli with the following command:

huggingface-cli download rajatkrishna/Meta-Llama-3.1-8b-Instruct-OpenVINO-INT4 --local-dir models/llama-3.1-instruct-8b

Quickstart with Docker

Install docker.
Build the docker image with the following command. The source files and model weights are pulled using git, requiring an active internet connection.
```
docker build -t chat-llama .
```
Mount the model directory and start the container using:
```
docker run -v $(pwd)/models:/chat-app/models -p 5000:5000 chat-llama
```
This should start the Flask dev server available on http://localhost:5000

Requirements

Python 3.11

Getting Started

Clone the repository

git clone https://github.com/rajatkrishna/llama3-openvino

Create a new virtual environment to avoid dependency conflicts:
```
python3 -m venv create .env
source .env/bin/activate
```
Install the dependencies in requirements.txt
```
pip install -r requirements.txt
```
Start the flask server from the project root using
```
python3 -m flask run
```

Export from HuggingFace

To export the meta-llama/Meta-Llama-3-8B-Instruct model quantized to INT-8 format yourself using optimum-intel CLI, install the requirements in requirements_export.txt:
```
pip install -r requirements_export.txt
```
Then run the following from the project root:
```
optimum-cli export openvino --model meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format int4 models/llama-3.1-instruct-8b
```

Alternately, use the following steps to export the INT-4 quantized model using the Python API:

Import the dependencies:

>>> from optimum.intel.openvino import OVWeightQuantizationConfig, OVModelForCausalLM
>>> from transformers import AutoTokenizer

Load the model using OVModelForCausalLM class. Set export=True to export the model on the fly.

>>> export_path = "models/llama-3.1-instruct-8b"
>>> q_config = OVWeightQuantizationConfig(bits=4, sym=True, group_size=128)
>>> model = OVModelForCausalLM.from_pretrained(model_name, export=True, quantization_config=q_config)
>>> model.save_pretrained(export_path)

Now use AutoTokenizer to save the tokenizer.

>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> tokenizer.save_pretrained(export_path)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
llama		llama
llama_assistant		llama_assistant
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
export.py		export.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
requirements_export.txt		requirements_export.txt
tailwind.config.js		tailwind.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Llama 3 Chat 🦙

Model Export

Quickstart with Docker

Requirements

Getting Started

Export from HuggingFace

About

Releases

Packages

Languages

License

rajatkrishna/chat-llama3

Folders and files

Latest commit

History

Repository files navigation

Local Llama 3 Chat 🦙

Model Export

Quickstart with Docker

Requirements

Getting Started

Export from HuggingFace

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages