Skip to content

Commit

Permalink
FEAT: support llama-3.3-instruct (#2661)
Browse files Browse the repository at this point in the history
  • Loading branch information
qinxuye authored Dec 13, 2024
1 parent 0ba44e5 commit d4f358f
Show file tree
Hide file tree
Showing 9 changed files with 286 additions and 1 deletion.
2 changes: 2 additions & 0 deletions doc/source/models/builtin/embedding/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ The following is a list of built-in embedding models in Xinference:

gte-qwen2

jina-clip-v2

jina-embeddings-v2-base-en

jina-embeddings-v2-base-zh
Expand Down
21 changes: 21 additions & 0 deletions doc/source/models/builtin/embedding/jina-clip-v2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _models_builtin_jina-clip-v2:

============
jina-clip-v2
============

- **Model Name:** jina-clip-v2
- **Languages:** 89 languages supported
- **Abilities:** embed

Specifications
^^^^^^^^^^^^^^

- **Dimensions:** 1024
- **Max Tokens:** 8192
- **Model ID:** jinaai/jina-clip-v2
- **Model Hubs**: `Hugging Face <https://huggingface.co/jinaai/jina-clip-v2>`__, `ModelScope <https://modelscope.cn/models/jinaai/jina-clip-v2>`__

Execute the following command to launch the model::

xinference launch --model-name jina-clip-v2 --model-type embedding
7 changes: 7 additions & 0 deletions doc/source/models/builtin/llm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,11 @@ The following is a list of built-in LLM in Xinference:
- 131072
- Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image...

* - :ref:`llama-3.3-instruct <models_llm_llama-3.3-instruct>`
- chat, tools
- 131072
- The Llama 3.3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks..

* - :ref:`minicpm-2b-dpo-bf16 <models_llm_minicpm-2b-dpo-bf16>`
- chat
- 4096
Expand Down Expand Up @@ -664,6 +669,8 @@ The following is a list of built-in LLM in Xinference:

llama-3.2-vision-instruct

llama-3.3-instruct

minicpm-2b-dpo-bf16

minicpm-2b-dpo-fp16
Expand Down
95 changes: 95 additions & 0 deletions doc/source/models/builtin/llm/llama-3.3-instruct.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
.. _models_llm_llama-3.3-instruct:

========================================
llama-3.3-instruct
========================================

- **Context Length:** 131072
- **Model Name:** llama-3.3-instruct
- **Languages:** en, de, fr, it, pt, hi, es, th
- **Abilities:** chat, tools
- **Description:** The Llama 3.3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks..

Specifications
^^^^^^^^^^^^^^


Model Spec 1 (pytorch, 70 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 70
- **Quantizations:** none
- **Engines**: vLLM, Transformers, SGLang
- **Model ID:** meta-llama/Llama-3.3-70B-Instruct
- **Model Hubs**: `Hugging Face <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct>`__, `ModelScope <https://modelscope.cn/models/LLM-Research/Llama-3.3-70B-Instruct>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine ${engine} --model-name llama-3.3-instruct --size-in-billions 70 --model-format pytorch --quantization ${quantization}


Model Spec 2 (gptq, 70 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** gptq
- **Model Size (in billions):** 70
- **Quantizations:** Int4
- **Engines**: vLLM, Transformers, SGLang
- **Model ID:** shuyuej/Llama-3.3-70B-Instruct-GPTQ
- **Model Hubs**: `Hugging Face <https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine ${engine} --model-name llama-3.3-instruct --size-in-billions 70 --model-format gptq --quantization ${quantization}


Model Spec 3 (awq, 70 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** awq
- **Model Size (in billions):** 70
- **Quantizations:** Int4
- **Engines**: vLLM, Transformers, SGLang
- **Model ID:** casperhansen/llama-3.3-70b-instruct-awq
- **Model Hubs**: `Hugging Face <https://huggingface.co/casperhansen/llama-3.3-70b-instruct-awq>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine ${engine} --model-name llama-3.3-instruct --size-in-billions 70 --model-format awq --quantization ${quantization}


Model Spec 4 (mlx, 70 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** mlx
- **Model Size (in billions):** 70
- **Quantizations:** 3bit, 4bit, 6bit, 8bit, fp16
- **Engines**: MLX
- **Model ID:** mlx-community/Llama-3.3-70B-Instruct-{quantization}
- **Model Hubs**: `Hugging Face <https://huggingface.co/mlx-community/Llama-3.3-70B-Instruct-{quantization}>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine ${engine} --model-name llama-3.3-instruct --size-in-billions 70 --model-format mlx --quantization ${quantization}


Model Spec 5 (ggufv2, 70 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 70
- **Quantizations:** Q3_K_L, Q4_K_M, Q6_K, Q8_0
- **Engines**: llama.cpp
- **Model ID:** lmstudio-community/Llama-3.3-70B-Instruct-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/lmstudio-community/Llama-3.3-70B-Instruct-GGUF>`__, `ModelScope <https://modelscope.cn/models/lmstudio-community/Llama-3.3-70B-Instruct-GGUF>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine ${engine} --model-name llama-3.3-instruct --size-in-billions 70 --model-format ggufv2 --quantization ${quantization}

2 changes: 1 addition & 1 deletion doc/source/user_guide/backends.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Currently, supported model includes:

.. vllm_start
- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-3.2-vision``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct``
- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-3.2-vision``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct``, ``llama-3.3-instruct``
- ``mistral-v0.1``, ``mistral-instruct-v0.1``, ``mistral-instruct-v0.2``, ``mistral-instruct-v0.3``, ``mistral-nemo-instruct``, ``mistral-large-instruct``
- ``codestral-v0.1``
- ``Yi``, ``Yi-1.5``, ``Yi-chat``, ``Yi-1.5-chat``, ``Yi-1.5-chat-16k``
Expand Down
92 changes: 92 additions & 0 deletions xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -1399,6 +1399,98 @@
}
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.3-instruct",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"chat",
"tools"
],
"model_description": "The Llama 3.3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks..",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 70,
"quantizations": [
"none"
],
"model_id": "meta-llama/Llama-3.3-70B-Instruct"
},
{
"model_format": "gptq",
"model_size_in_billions": 70,
"quantizations": [
"Int4"
],
"model_id": "shuyuej/Llama-3.3-70B-Instruct-GPTQ"
},
{
"model_format": "awq",
"model_size_in_billions": 70,
"quantizations": [
"Int4"
],
"model_id": "casperhansen/llama-3.3-70b-instruct-awq"
},
{
"model_format": "mlx",
"model_size_in_billions": 70,
"quantizations": [
"3bit",
"4bit",
"6bit",
"8bit",
"fp16"
],
"model_id": "mlx-community/Llama-3.3-70B-Instruct-{quantization}"
},
{
"model_format": "ggufv2",
"model_size_in_billions": 70,
"quantizations": [
"Q3_K_L",
"Q4_K_M",
"Q6_K",
"Q8_0"
],
"quantization_parts": {
"Q6_K": [
"00001-of-00002",
"00002-of-00002"
],
"Q8_0": [
"00001-of-00002",
"00002-of-00002"
]
},
"model_id": "lmstudio-community/Llama-3.3-70B-Instruct-GGUF",
"model_file_name_template": "Llama-3.3-70B-Instruct-{quantization}.gguf",
"model_file_name_split_template": "Llama-3.3-70B-Instruct-{quantization}-{part}.gguf"
}
],
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
"stop_token_ids": [
128001,
128008,
128009
],
"stop": [
"<|end_of_text|>",
"<|eot_id|>",
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 2048,
Expand Down
66 changes: 66 additions & 0 deletions xinference/model/llm/llm_family_modelscope.json
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,72 @@
}
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.3-instruct",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"chat",
"tools"
],
"model_description": "The Llama 3.3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks..",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 70,
"quantizations": [
"none"
],
"model_id": "LLM-Research/Llama-3.3-70B-Instruct",
"model_hub": "modelscope"
},
{
"model_format": "ggufv2",
"model_size_in_billions": 70,
"quantizations": [
"Q3_K_L",
"Q4_K_M",
"Q6_K",
"Q8_0"
],
"quantization_parts": {
"Q6_K": [
"00001-of-00002",
"00002-of-00002"
],
"Q8_0": [
"00001-of-00002",
"00002-of-00002"
]
},
"model_id": "lmstudio-community/Llama-3.3-70B-Instruct-GGUF",
"model_file_name_template": "Llama-3.3-70B-Instruct-{quantization}.gguf",
"model_file_name_split_template": "Llama-3.3-70B-Instruct-{quantization}-{part}.gguf",
"model_hub": "modelscope"
}
],
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
"stop_token_ids": [
128001,
128008,
128009
],
"stop": [
"<|end_of_text|>",
"<|eot_id|>",
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 2048,
Expand Down
Loading

0 comments on commit d4f358f

Please sign in to comment.