Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[NeuralChat] Fix PC codegen streaming issue and set 'Intel/neural-chat-7b-v3-1' as default model #920

Merged
merged 5 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions intel_extension_for_transformers/neural_chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ NeuralChat supports fine-tuning the pretrained large language model (LLM) for te

```shell
# Command line
neuralchat finetune --base_model "meta-llama/Llama-2-7b-chat-hf" --config pipeline/finetuning/config/finetuning.yaml
neuralchat finetune --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/finetuning/config/finetuning.yaml
```

```python
Expand All @@ -124,7 +124,7 @@ NeuralChat provides typical model optimization technologies, like `Automatic Mix

```shell
# Command line
neuralchat optimize --base_model "meta-llama/Llama-2-7b-chat-hf" --config pipeline/optimization/config/optimization.yaml
neuralchat optimize --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/optimization/config/optimization.yaml
```

```python
Expand Down
2 changes: 1 addition & 1 deletion intel_extension_for_transformers/neural_chat/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,7 +412,7 @@ class LoadingModelConfig:

class PipelineConfig:
def __init__(self,
model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
model_name_or_path="Intel/neural-chat-7b-v3-1",
tokenizer_name_or_path=None,
hf_access_token=None,
device="auto",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ pip install -r ../../../requirements.txt
## Install Models
```shell
git-lfs install
# download llama-2 model for NER plugin
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
# download neural-chat-7b-v3-1 model for NER plugin
git clone https://huggingface.co/Intel/neural-chat-7b-v3-1
# download spacy model for NER post process
python -m spacy download en_core_web_lg
```
Expand Down Expand Up @@ -83,7 +83,7 @@ You can customize the configuration file `photoai.yaml` to match your environmen
| ------------------- | ---------------------------------------|
| host | 127.0.0.1 |
| port | 9000 |
| model_name_or_path | "./Llama-2-7b-chat-hf" |
| model_name_or_path | "./neural-chat-7b-v3-1" |
| device | "auto" |
| asr.enable | true |
| tts.enable | true |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ pip install -r ../../../requirements.txt
## Download Models
```shell
git-lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
git clone https://huggingface.co/Intel/neural-chat-7b-v3-1
```


Expand All @@ -58,7 +58,7 @@ You can customize the configuration file 'askdoc.yaml' to match your environment
| --------------------------------- | ---------------------------------------|
| host | 127.0.0.1 |
| port | 8000 |
| model_name_or_path | "./Llama-2-7b-chat-hf" |
| model_name_or_path | "./neural-chat-7b-v3-1" |
| device | "auto" |
| retrieval.enable | true |
| retrieval.args.input_path | "./docs" |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ You can customize the configuration file 'textbot.yaml' to match your environmen
| --------------------- | --------------------------------------- |
| host | 127.0.0.1 |
| port | 8888 |
| model_name_or_path | "meta-llama/Llama-2-7b-chat-hf" |
| model_name_or_path | "Intel/neural-chat-7b-v3-1" |
| device | "cpu" |
| asr.enable | true |
| asr.args.device | "cpu" |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
host: 0.0.0.0
port: 8888

model_name_or_path: "meta-llama/Llama-2-7b-chat-hf"
model_name_or_path: "Intel/neural-chat-7b-v3-1"
device: "cpu"

asr:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ You can customize the configuration file 'textbot.yaml' to match your environmen
| ------------------- | --------------------------------------- |
| host | 127.0.0.1 |
| port | 8000 |
| model_name_or_path | "meta-llama/Llama-2-7b-chat-hf" |
| model_name_or_path | "Intel/neural-chat-7b-v3-1" |
| device | "cpu" |
| tasks_list | ['textchat'] |

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
host: 0.0.0.0
port: 8000

model_name_or_path: "meta-llama/Llama-2-7b-chat-hf"
model_name_or_path: "Intel/neural-chat-7b-v3-1"
device: "cpu"

# users can choose one of the ipex int8, itrex int4, mix precision and
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@ pip install -r ../../../requirements.txt
You can customize the configuration file 'textbot.yaml' to match your environment setup. Here's a table to help you understand the configurable options:

| Item | Value |
| ------------------- | --------------------------------------- |
| host | 127.0.0.1 |
| port | 8000 |
| model_name_or_path | "meta-llama/Llama-2-7b-chat-hf" |
| device | "cpu" |
| cache.enable | true |
| tasks_list | ['textchat'] |
| ------------------- | ----------------------------------------- |
| host | 127.0.0.1 |
| port | 8000 |
| model_name_or_path | "Intel/neural-chat-7b-v3-1" |
| device | "cpu" |
| cache.enable | true |
| tasks_list | ['textchat'] |



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -742,7 +742,7 @@ def predict_stream(**params):
`num_beams` (int): Controls the number of beams used in beam search.
Higher values increase the diversity but also the computation time.
`model_name` (string): Specifies the name of the pre-trained model to use for text generation.
If not provided, the default model is "mosaicml/mpt-7b-chat".
If not provided, the default model is "Intel/neural-chat-7b-v3-1".
`num_return_sequences` (int): Specifies the number of alternative sequences to generate.
`bad_words_ids` (list or None): Contains a list of token IDs that should not appear in the generated text.
`force_words_ids` (list or None): Contains a list of token IDs that must be included in the generated text.
Expand All @@ -768,7 +768,7 @@ def predict_stream(**params):
do_sample = params["do_sample"] if "do_sample" in params else True
num_beams = int(params["num_beams"]) if "num_beams" in params else 0
model_name = (
params["model_name"] if "model_name" in params else "mosaicml/mpt-7b-chat"
params["model_name"] if "model_name" in params else "Intel/neural-chat-7b-v3-1"
)
num_return_sequences = (
params["num_return_sequences"] if "num_return_sequences" in params else 1
Expand All @@ -789,7 +789,9 @@ def predict_stream(**params):

if is_llm_runtime_model(model):
prompt = remove_prompt_history(model_name, prompt)
max_new_tokens = max_new_tokens if max_new_tokens > 1024 else 1024
max_new_tokens = max_new_tokens if (max_new_tokens > 1024 or \
"codellama" in model_name.lower() or \
"starcoder" in model_name.lower()) else 1024

streamer = TextIteratorStreamer(
tokenizer, skip_prompt=True, skip_special_tokens=True
Expand Down Expand Up @@ -1033,7 +1035,9 @@ def predict(**params):

if is_llm_runtime_model(model):
prompt = remove_prompt_history(model_name, prompt)
max_new_tokens = max_new_tokens if max_new_tokens > 1024 else 1024
max_new_tokens = max_new_tokens if (max_new_tokens > 1024 or \
"codellama" in model_name.lower() or \
"starcoder" in model_name.lower()) else 1024

if num_beams == 0:
num_beams = 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,15 +54,15 @@ print("NER result: ", result)
## Plugin Parameters
You can costomize the NER inference parameters to meet the personal demands for better performance. You can set the specific parameter by `plugins.ner.args["xxx"]`. Below are the descriptions of each available parameters.
```python
model_name_or_path [str]: The huggingface model name or local path of the downloaded llm model. Default to "./Llama-2-7b-chat-hf/".
model_name_or_path [str]: The huggingface model name or local path of the downloaded llm model. Default to "./neural-chat-7b-v3-1/".

spacy_model [str]: The Spacy model for NLP process, specify it according to the downloaded Spacy model. Default to "en_core_web_lg".

bf16 [bool]: Choose wether to use BF16 precision for NER inference. Default to False.
```
As for INT8 and INT4 model the plugin parameters are slightly different. You can set the specific parameter by `plugins.ner_int.args["xxx"]`.
```python
model_name_or_path [str]: The huggingface model name or local path of the downloaded llm model. Default to "./Llama-2-7b-chat-hf/".
model_name_or_path [str]: The huggingface model name or local path of the downloaded llm model. Default to "./neural-chat-7b-v3-1/".

spacy_model [str]: The Spacy model for NLP process, specify it according to the downloaded Spacy model. Default to "en_core_web_lg".

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ class NamedEntityRecognitionINT():
"""

def __init__(self,
model_path="meta-llama/Llama-2-7b-chat-hf",
model_path="Intel/neural-chat-7b-v3-1",
spacy_model="en_core_web_lg",
compute_dtype="fp32",
weight_dtype="int8",
Expand Down
6 changes: 3 additions & 3 deletions intel_extension_for_transformers/neural_chat/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ NeuralChat provides a default chatbot configuration in `./config/neuralchat.yaml
| ------------------------- | ------------------------ | --------------------------------------- | --------------------------------- |
| host | | 0.0.0.0 | Any valid IP address |
| port | | 8000 | Any valid port number |
| model_name_or_path | | "meta-llama/Llama-2-7b-chat-hf" | A valid model name or path |
| model_name_or_path | | "Intel/neural-chat-7b-v3-1" | A valid model name or path |
| tokenizer_name_or_path | | "" | A tokenizer name or path |
| peft_model_path | | "" | A peft model path |
| device | | "auto" | "cpu", "hpu", "xpu", "cuda" |
Expand Down Expand Up @@ -47,11 +47,11 @@ NeuralChat provides a default chatbot configuration in `./config/neuralchat.yaml
| | args.embedding_model_dir | "hkunlp/instructor-large" | A valid directory path |
| safety_checker | enable | false | true, false |
| ner | enable | false | true, false |
| | args.model_path | "meta-llama/Llama-2-7b-chat-hf" | A valid directory path of llm model |
| | args.model_path | "Intel/neural-chat-7b-v3-1" | A valid directory path of llm model |
| | args.spacy_model | "en_core_web_lg" | A valid name of downloaded spacy model |
| | args.bf16 | false | true, false |
| ner_int | enable | false | true, false |
| | args.model_path | "meta-llama/Llama-2-7b-chat-hf" | A valid directory path of llm model |
| | args.model_path | "Intel/neural-chat-7b-v3-1" | A valid directory path of llm model |
| | args.spacy_model | "en_core_web_lg" | A valid name of downloaded spacy model |
| | args.compute_dtype | "fp32" | "fp32", "int8" |
| | args.weight_dtype | "int8" | "int8", "int4" |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
host: 0.0.0.0
port: 8000

model_name_or_path: "meta-llama/Llama-2-7b-chat-hf"
model_name_or_path: "Intel/neural-chat-7b-v3-1"
# tokenizer_name_or_path: ""
# peft_model_path: ""
device: "auto"
Expand Down Expand Up @@ -80,15 +80,15 @@ ner:
enable: false
args:
device: "cpu"
model_path: "meta-llama/Llama-2-7b-chat-hf"
model_path: "Intel/neural-chat-7b-v3-1"
spacy_model: "en_core_web_lg"
bf16: False

ner_int:
enable: false
args:
device: "cpu"
model_path: "meta-llama/Llama-2-7b-chat-hf"
model_path: "Intel/neural-chat-7b-v3-1"
spacy_model: "en_core_web_lg"
compute_dtype: "fp32"
weight_dtype: "int8"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,22 +100,7 @@ def handle_chat_completion_request(self, request: ChatCompletionRequest):
def stream_generator():
nonlocal buffered_texts
for output in generator:
if isinstance(output, str):
chunks = output.split()
for chunk in chunks:
ret = {
"text": chunk,
"error_code": 0,
}
buffered_texts += chunk + ' '
yield json.dumps(ret).encode() + b"\0"
else:
ret = {
"text": output,
"error_code": 0,
}
buffered_texts += output + ' '
yield json.dumps(ret).encode() + b"\0"
yield f"data: {output}\n\n"
yield f"data: [DONE]\n\n"
if is_plugin_enabled("cache") and \
not plugins["cache"]["instance"].pre_llm_inference_actions(request.prompt):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ class RetrievalRequest(RequestBaseModel):


class FinetuneRequest(RequestBaseModel):
model_name_or_path: str = "meta-llama/Llama-2-7b-chat-hf"
model_name_or_path: str = "Intel/neural-chat-7b-v3-1"
train_file: str = None
dataset_name: str = None
output_dir: str = './tmp'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ This will run the chatbot application in the background on your server.
Once the application is running, you can find the access URL in the trace log:

```log
INFO | gradio_web_server | Models: meta-llama/Llama-2-7b-chat-hf
INFO | gradio_web_server | Models: Intel/neural-chat-7b-v3-1
INFO | stdout | Running on local URL: http://0.0.0.0:7860
```
The URL to access the chatbot frontend is http://{SERVER_IP_ADDRESS}:80. Please remember to replace {SERVER_IP_ADDRESS} with your server's actual IP address.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ This will run the chatbot application in the background on your server.
Once the application is running, you can find the access URL in the trace log:

```log
INFO | gradio_web_server | Models: meta-llama/Llama-2-7b-chat-hf
INFO | gradio_web_server | Models: Intel/neural-chat-7b-v3-1
INFO | stdout | Running on local URL: http://0.0.0.0:7860
```
since there are two services, start two backends and generate URLs for both backends.
Expand Down
Loading