xorbitsai · qinxuye · Jun 5, 2024 · Jun 1, 2024 · Jun 5, 2024 · Jun 5, 2024
diff --git a/doc/source/getting_started/installation.rst b/doc/source/getting_started/installation.rst
@@ -43,6 +43,7 @@ Currently, supported models include:
 - ``baichuan``, ``baichuan-chat``, ``baichuan-2-chat``
 - ``internlm-16k``, ``internlm-chat-7b``, ``internlm-chat-8k``, ``internlm-chat-20b``
 - ``mistral-v0.1``, ``mistral-instruct-v0.1``, ``mistral-instruct-v0.2``, ``mistral-instruct-v0.3``
+- ``codestral-v0.1``
 - ``Yi``, ``Yi-1.5``, ``Yi-chat``, ``Yi-1.5-chat``, ``Yi-1.5-chat-16k``
 - ``code-llama``, ``code-llama-python``, ``code-llama-instruct``
 - ``deepseek``, ``deepseek-coder``, ``deepseek-chat``, ``deepseek-coder-instruct``

diff --git a/doc/source/models/builtin/llm/codestral-v0.1.rst b/doc/source/models/builtin/llm/codestral-v0.1.rst
@@ -0,0 +1,47 @@
+.. _models_llm_codestral-v0.1:
+
+========================================
+codestral-v0.1
+========================================
+
+- **Context Length:** 32768
+- **Model Name:** codestral-v0.1
+- **Languages:** en
+- **Abilities:** generate
+- **Description:** Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 22 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 22
+- **Quantizations:** 4-bit, 8-bit, none
+- **Engines**: vLLM, Transformers (vLLM only available for quantization none)
+- **Model ID:** mistralai/Mistral-7B-Instruct-v0.2
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name codestral-v0.1 --size-in-billions 22 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (ggufv2, 22 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 22
+- **Quantizations:** Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0
+- **Engines**: llama.cpp
+- **Model ID:** bartowski/Codestral-22B-v0.1-GGUF
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name codestral-v0.1 --size-in-billions 22 --model-format ggufv2 --quantization ${quantization}
+
diff --git a/doc/source/models/builtin/llm/index.rst b/doc/source/models/builtin/llm/index.rst
@@ -126,6 +126,11 @@ The following is a list of built-in LLM in Xinference:
      - 8194
      - CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
 
+   * - :ref:`codestral-v0.1 <models_llm_codestral-v0.1>`
+     - generate
+     - 32768
+     - Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash
+
    * - :ref:`cogvlm2 <models_llm_cogvlm2>`
      - chat, vision
      - 8192
@@ -276,6 +281,11 @@ The following is a list of built-in LLM in Xinference:
      - 4096
      - MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
 
+   * - :ref:`minicpm-llama3-v-2_5 <models_llm_minicpm-llama3-v-2_5>`
+     - chat, vision
+     - 2048
+     - MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.
+
    * - :ref:`mistral-instruct-v0.1 <models_llm_mistral-instruct-v0.1>`
      - chat
      - 8192
@@ -570,6 +580,8 @@ The following is a list of built-in LLM in Xinference:
 
    codeshell-chat
 
+   codestral-v0.1
+
    cogvlm2
 
    deepseek
@@ -630,6 +642,8 @@ The following is a list of built-in LLM in Xinference:
 
    minicpm-2b-sft-fp32
 
+   minicpm-llama3-v-2_5
+
    mistral-instruct-v0.1
 
    mistral-instruct-v0.2

diff --git a/doc/source/models/builtin/llm/internvl-chat.rst b/doc/source/models/builtin/llm/internvl-chat.rst
@@ -20,13 +20,14 @@ Model Spec 1 (pytorch, 2 Billion)
 - **Model Format:** pytorch
 - **Model Size (in billions):** 2
 - **Quantizations:** none
+- **Engines**: Transformers
 - **Model ID:** OpenGVLab/Mini-InternVL-Chat-2B-V1-5
 - **Model Hubs**:  `Hugging Face <https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::
 
-   xinference launch --model-name internvl-chat --size-in-billions 2 --model-format pytorch --quantization ${quantization}
+   xinference launch --model-engine ${engine} --model-name internvl-chat --size-in-billions 2 --model-format pytorch --quantization ${quantization}
 
 
 Model Spec 2 (pytorch, 26 Billion)
@@ -35,13 +36,14 @@ Model Spec 2 (pytorch, 26 Billion)
 - **Model Format:** pytorch
 - **Model Size (in billions):** 26
 - **Quantizations:** none
+- **Engines**: Transformers
 - **Model ID:** OpenGVLab/InternVL-Chat-V1-5
 - **Model Hubs**:  `Hugging Face <https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5>`__, `ModelScope <https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-{quantization}>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::
 
-   xinference launch --model-name internvl-chat --size-in-billions 26 --model-format pytorch --quantization ${quantization}
+   xinference launch --model-engine ${engine} --model-name internvl-chat --size-in-billions 26 --model-format pytorch --quantization ${quantization}
 
 
 Model Spec 3 (pytorch, 26 Billion)
@@ -50,11 +52,12 @@ Model Spec 3 (pytorch, 26 Billion)
 - **Model Format:** pytorch
 - **Model Size (in billions):** 26
 - **Quantizations:** Int8
+- **Engines**: Transformers
 - **Model ID:** OpenGVLab/InternVL-Chat-V1-5-{quantization}
 - **Model Hubs**:  `Hugging Face <https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-{quantization}>`__, `ModelScope <https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-{quantization}>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::
 
-   xinference launch --model-name internvl-chat --size-in-billions 26 --model-format pytorch --quantization ${quantization}
+   xinference launch --model-engine ${engine} --model-name internvl-chat --size-in-billions 26 --model-format pytorch --quantization ${quantization}
 
diff --git a/doc/source/models/builtin/llm/telechat.rst b/doc/source/models/builtin/llm/telechat.rst
@@ -54,7 +54,7 @@ Model Spec 3 (pytorch, 12 Billion)
 - **Quantizations:** 4-bit, 8-bit, none
 - **Engines**: Transformers
 - **Model ID:** Tele-AI/TeleChat-12B
-- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-12B>`__, `ModelScope <https://modelscope.cn/models/Tele-AI/TeleChat-12B>`__
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-12B>`__, `ModelScope <https://modelscope.cn/models/TeleAI/TeleChat-12B>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::
@@ -70,7 +70,7 @@ Model Spec 4 (gptq, 12 Billion)
 - **Quantizations:** int4, int8
 - **Engines**: Transformers
 - **Model ID:** Tele-AI/TeleChat-12B-{quantization}
-- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-12B-{quantization}>`__, `ModelScope <https://modelscope.cn/models/Tele-AI/TeleChat-12B-{quantization}>`__
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-12B-{quantization}>`__, `ModelScope <https://modelscope.cn/models/TeleAI/TeleChat-12B-{quantization}>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::
@@ -86,7 +86,7 @@ Model Spec 5 (pytorch, 52 Billion)
 - **Quantizations:** 4-bit, 8-bit, none
 - **Engines**: Transformers
 - **Model ID:** Tele-AI/TeleChat-52B
-- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-52B>`__, `ModelScope <https://modelscope.cn/models/Tele-AI/TeleChat-52B>`__
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/Tele-AI/TeleChat-52B>`__, `ModelScope <https://modelscope.cn/models/TeleAI/TeleChat-52B>`__
 
 Execute the following command to launch the model, remember to replace ``${quantization}`` with your
 chosen quantization method from the options listed above::

diff --git a/doc/source/user_guide/backends.rst b/doc/source/user_guide/backends.rst
@@ -50,6 +50,7 @@ Currently, supported model includes:
 - ``baichuan``, ``baichuan-chat``, ``baichuan-2-chat``
 - ``internlm-16k``, ``internlm-chat-7b``, ``internlm-chat-8k``, ``internlm-chat-20b``
 - ``mistral-v0.1``, ``mistral-instruct-v0.1``, ``mistral-instruct-v0.2``, ``mistral-instruct-v0.3``
+- ``codestral-v0.1``
 - ``Yi``, ``Yi-1.5``, ``Yi-chat``, ``Yi-1.5-chat``, ``Yi-1.5-chat-16k``
 - ``code-llama``, ``code-llama-python``, ``code-llama-instruct``
 - ``deepseek``, ``deepseek-coder``, ``deepseek-chat``, ``deepseek-coder-instruct``

diff --git a/xinference/model/llm/llm_family.json b/xinference/model/llm/llm_family.json
@@ -3417,6 +3417,49 @@
       ]
     }
   },
+  {
+    "version": 1,
+    "context_length": 32768,
+    "model_name": "codestral-v0.1",
+    "model_lang": [
+      "en"
+    ],
+    "model_ability": [
+      "generate"
+    ],
+    "model_description": "Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash",
+    "model_specs": [
+      {
+        "model_format": "pytorch",
+        "model_size_in_billions": 22,
+        "quantizations": [
+          "4-bit",
+          "8-bit",
+          "none"
+        ],
+        "model_id": "mistralai/Mistral-7B-Instruct-v0.2",
+        "model_revision": "9552e7b1d9b2d5bbd87a5aa7221817285dbb6366"
+      },
+      {
+        "model_format": "ggufv2",
+        "model_size_in_billions": 22,
+        "quantizations": [
+          "Q2_K",
+          "Q3_K_S",
+          "Q3_K_M",
+          "Q3_K_L",
+          "Q4_K_S",
+          "Q4_K_M",
+          "Q5_K_S",
+          "Q5_K_M",
+          "Q6_K",
+          "Q8_0"
+        ],
+        "model_id": "bartowski/Codestral-22B-v0.1-GGUF",
+        "model_file_name_template": "Codestral-22B-v0.1-{quantization}.gguf"
+      }
+    ]
+  },
   {
     "version": 1,
     "context_length": 8192,

diff --git a/xinference/model/llm/pytorch/chatglm.py b/xinference/model/llm/pytorch/chatglm.py
@@ -83,7 +83,7 @@ def match(
         if llm_spec.model_format != "pytorch":
             return False
         model_family = llm_family.model_family or llm_family.model_name
-        if "chatglm" not in model_family or "glm4" not in model_family:
+        if "chatglm" not in model_family and "glm4" not in model_family:
             return False
         if "chat" not in llm_family.model_ability:
             return False

diff --git a/xinference/model/llm/pytorch/core.py b/xinference/model/llm/pytorch/core.py
@@ -53,6 +53,11 @@
     "chatglm2",
     "chatglm2-32k",
     "chatglm2-128k",
+    "chatglm3",
+    "chatglm3-32k",
+    "chatglm3-128k",
+    "glm4-chat",
+    "glm4-chat-1m",
     "llama-2",
     "llama-2-chat",
     "internlm2-chat",

diff --git a/xinference/model/llm/vllm/core.py b/xinference/model/llm/vllm/core.py
@@ -93,6 +93,7 @@ class VLLMGenerateConfig(TypedDict, total=False):
     "baichuan",
     "internlm-16k",
     "mistral-v0.1",
+    "codestral-v0.1",
     "Yi",
     "Yi-1.5",
     "code-llama",