Skip to content

Commit

Permalink
Add CodeLlama usage to the README and make sure it works (pytorch#330)
Browse files Browse the repository at this point in the history
  • Loading branch information
orionr authored and malfet committed Jul 17, 2024
1 parent 9209436 commit 81d09b7
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 27 deletions.
72 changes: 46 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ git clone https://github.com/pytorch/torchchat.git
cd torchchat
pip install -r requirements.txt
# ensure everything installed correctly. If this command works you'll see a welcome message and some details
# ensure everything installed correctly
python torchchat.py --help
```
Expand Down Expand Up @@ -57,13 +57,13 @@ python torchchat.py download llama3

### Chat
Designed for interactive and conversational use.
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation. This mode is typically what you see in applications aimed at simulating conversational partners or providing customer support.
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.

For more information run `python torchchat.py chat --help`

**Examples**
```
# Chat with some parameters
python torchchat.py chat llama3 --tiktoken
```

### Generate
Expand All @@ -74,18 +74,24 @@ For more information run `python torchchat.py generate --help`

**Examples**
```
python torchchat.py generate llama3 --device=cpu --dtype=fp16 --tiktoken
python torchchat.py generate llama3 --dtype=fp16 --tiktoken
```

### Export
Compiles a model for different use cases
Compiles a model and saves it to run later.

For more information run `python torchchat.py export --help`

**Examples**

AOT Inductor:
```
python torchchat.py export stories15M --output-pte-path=stories15m.pte
python torchchat.py export stories15M --output-dso-path stories15M.so
```

ExecuTorch:
```
python torchchat.py export stories15M --output-pte-path stories15M.pte
```

### Browser
Expand All @@ -94,20 +100,20 @@ Run a chatbot in your browser that’s supported by the model you specify in the
**Examples**

```
python torchchat.py browser stories15M --device cpu --temperature 0 --num-samples 10
python torchchat.py browser stories15M --temperature 0 --num-samples 10
```

*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.

Enter some text in the input box, then hit the enter key or click the “SEND” button. After 1 second or 2, the text you entered together with the generated text will be displayed. Repeat to have a conversation.

### Eval
Uses lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.l
Uses lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.

For more information run `python torchchat.py eval --help`

**Examples**

Eager mode:
```
python torchchat.py eval stories15M -d fp32 --limit 5
Expand All @@ -118,6 +124,7 @@ To test the perplexity for lowered or quantized model, pass it in the same way y
```
python torchchat.py eval stories15M --pte-path stories15M.pte --limit 5
```

## Models
These are the supported models
| Model | Mobile Friendly | Notes |
Expand All @@ -139,59 +146,72 @@ These are the supported models
See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.

**Examples**

```
#Llama3
# Llama 3 8B Instruct
python torchchat.py chat llama3 --tiktoken
```

```
#Stories
# Stories 15M
python torchchat.py chat stories15M
```

```
#CodeLama
# CodeLama 7B for Python
python torchchat.py chat codellama
```

## Desktop Execution

### AOTI (AOT Inductor ) - PC Specific
AOT compiles models into machine code before execution, enhancing performance and predictability. It's particularly beneficial for frequently used models or those requiring quick start times. AOTI also increases security by not exposing the model at runtime. However, it may lead to larger binary sizes and lacks the runtime optimization flexibility
### AOTI (AOT Inductor)
AOT compiles models into machine code before execution, enhancing performance and predictability. It's particularly beneficial for frequently used models or those requiring quick start times. However, it may lead to larger binary sizes and lacks the runtime flexibility of eager mode.

**Examples**
The following example uses the Stories15M model.
```
# Compile
python torchchat.py export stories15M --device cpu --output-dso-path stories15M.so
python torchchat.py export stories15M --output-dso-path stories15M.so
# Execute
python torchchat.py generate --device cpu --dso-path stories15M.so --prompt "Hello my name is"
python torchchat.py generate --dso-path stories15M.so --prompt "Hello my name is"
```

NOTE: The exported model will be large. We suggest you quantize the model, explained further down, before deploying the model on device.

### ExecuTorch
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device, but can also be used on desktop for testing.

**Examples**
The following example uses the Stories15M model.
```
# Compile
python torchchat.py export stories15M --output-pte-path stories15M.pte
If you want to deploy and execute a model within your iOS app <do this>
If you want to deploy and execute a model within your Android app <do this>
If you want to deploy and execute a model within your edge device <do this>
If you want to experiment with our sample apps. Check out our iOS and Android sample apps.
# Execute
python torchchat.py generate --device cpu --pte-path stories15M.pte --prompt "Hello my name is"
```

See below under Mobile Execution if you want to deploy and execute a model in your iOS or Android app.

## Quantization
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy.
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit and 4-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy.

TODO:
- Brief rundown on supported quant modes and torchchat.py flags (emphasis on brief).
- Recommendations for quantization modes for 7b local chat, 7b on mobile, etc.
- One line that shows the performance difference between the base model and the 4bit
- Link to Quantization.md.

Read the [Quantization documention](docs/quantization.md) for more details.
Read the [quantization documention](docs/quantization.md) for more details.

## Mobile Execution
**Prerequisites**

Install [ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html)
ExecuTorch lets you run your model on a mobile or embedded device. The exported ExecuTorch .pte model file plus runtime is all you need.

Install [ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html) to get started.

[iOS Details](docs/iOS.md)
Read the [iOS documentation](docs/iOS.md) for more details on iOS.

[Android Details](docs/Android.md)
Read the [Android documentation](docs/Android.md) for more details on Android.
5 changes: 5 additions & 0 deletions config/data/models.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@
"distribution_channel": "HuggingFaceSnapshot",
"distribution_path": "meta-llama/Llama-2-7b-chat-hf"
},
"meta-llama/CodeLlama-7b-Python-hf": {
"aliases": ["codellama", "codellama-7b"],
"distribution_channel": "HuggingFaceSnapshot",
"distribution_path": "meta-llama/CodeLlama-7b-Python-hf"
},
"mistralai/Mistral-7B-Instruct-v0.2": {
"aliases": ["mistral-7b", "mistral-7b-instruct"],
"distribution_channel": "HuggingFaceSnapshot",
Expand Down
3 changes: 2 additions & 1 deletion download.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,9 @@ def download_and_convert(
def is_model_downloaded(model: str, models_dir: Path) -> bool:
model_config = resolve_model_config(model)

# Check if the model directory exists and is not empty.
model_dir = models_dir / model_config.name
return os.path.isdir(model_dir)
return os.path.isdir(model_dir) and os.listdir(model_dir)


def main(args):
Expand Down

0 comments on commit 81d09b7

Please sign in to comment.