added chat model support for yi

vectorch-ai · Nov 23, 2023 · 68854da · 68854da
1 parent 83cf084
commit 68854da
Show file tree

Hide file tree

Showing 9 changed files with 116 additions and 47 deletions.
diff --git a/README.md b/README.md
@@ -19,12 +19,14 @@ In the coming weeks, we have exciting plans to focus on [**_speculative decoding
 ## Table of contents
 
 - [Overview](#overview)
+- [Supported Models](#supported-models)
 - [Get Started](#get-started)
-  - [Docker Container](#docker-container)
+  - [ScaleLLM server](#scalellm-server)
+  - [Rest API Server](#rest-api-server)
+  - [Chatbot UI](#chatbot-ui)
   - [Docker Compose](#docker-compose)
 - [Usage Examples](#usage-examples)
-- [Supported Models](#supported-models)
-- [Quatization](#quatization)
+- [Quantization](#quantization)
 - [Limitations](#limitations)
 - [Contributing](#Contributing)
 - [Acknowledgements](#acknowledgements)
@@ -44,6 +46,33 @@ ScaleLLM is a cutting-edge inference system engineered for large language models
 - [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
 - [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
 
+
+## Supported Models
+
+Please note that in order to use Yi models, you need to add `--model_type=Yi` to the command line. For example:
+```bash
+docker run -it --gpus=all --net=host --shm-size=1g \
+  -v $HOME/.cache/huggingface/hub:/models \
+  -e HF_MODEL_ID=01-ai/Yi-34B-Chat-4bits \
+  -e DEVICE=auto \
+  docker.io/vectorchai/scalellm:latest --logtostderr --model_type=Yi
+```
+
+|   Models   | Tensor Parallel | Quantization | Chat API | HF models examples |
+| :--------: | :-------------: | :----------: | :------: | :---------------------------:|
+|    Yi      |       Yes       |     Yes      |    Yes    |[01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B), [01-ai/Yi-34B-Chat-4bits](https://huggingface.co/01-ai/Yi-34B-Chat-4bits), [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K), [casperhansen/yi-6b-awq](https://huggingface.co/casperhansen/yi-6b-awq), [TheBloke/Yi-34B-GPTQ](https://huggingface.co/TheBloke/Yi-34B-GPTQ) |
+|   Llama2   |       Yes       |     Yes      |    Yes   | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
+|   Aquila   |       Yes       |     Yes      |    Yes   | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |
+|   Bloom    |       Yes       |     Yes      |    No    | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
+|   GPT_j    |       Yes       |     Yes      |    No    | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
+|  GPT_NeoX  |       Yes       |     Yes      |    No    | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
+|    GPT2    |       Yes       |     Yes      |    No    | [gpt2](https://huggingface.co/gpt2)|
+| InternLM   |       Yes       |     Yes      |    Yes   | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |
+|  Mistral   |       Yes       |     Yes      |    Yes   | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
+|    MPT     |       Yes       |     Yes      |    No    | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |
+
+If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on [GitHub Issues](https://github.com/vectorch-ai/ScaleLLM/issues).
+
 ## Getting Started
 
 The easiest way to get started with our project is by using the official Docker images. If you don't have Docker installed, please follow the installation instructions for your platform.
@@ -55,9 +84,9 @@ You can download and install Docker from the official website: [Docker Installat
 > **Note**<br />
 > To use GPUs, you also need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
 
-### Docker Container
+### ScaleLLM server
 
-Once you have Docker installed, you can run our project's Docker container using the following command:
+Once you have Docker installed, you can run ScaleLLM Docker container using the following command:
 
 ```bash
 docker run -it --gpus=all --net=host --shm-size=1g \
@@ -80,7 +109,7 @@ This command starts the Docker container with GPU support and various configurat
 > **Note**<br />
 > Although ScaleLLM supports both `CPU` and `GPU`, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal. If you want to use CPU, please set `DEVICE=cpu` in the command.
 
-### Ports and Endpoints
+#### Ports and Endpoints
 
 After running the Docker container, two ports are exposed:
 
@@ -108,7 +137,7 @@ docker run -it --net=host \
 
 The REST API Server is available on `localhost:8080`. You can use REST API requests to interact with the system. Check out the [Usage Examples](#usage-examples) section for more details.
 
-### Local Chatbot UI
+### Chatbot UI
 
 A local Chatbot UI is also available on [localhost:3000](localhost:3000). You can start it with the following command:
 
@@ -119,7 +148,7 @@ docker run -it --net=host \
   docker.io/vectorchai/chatbot-ui:latest
 ```
 
-## Docker Compose
+### Docker Compose
 
 Using Docker Compose is the easiest way to run ScaleLLM with all the services together. If you don't have Docker Compose installed, please follow the [installation doc](https://docs.docker.com/compose/install/) for your platform.
 
@@ -231,23 +260,6 @@ for chunk in completion:
         print(content, end="")
 ```
 
-## Supported Models
-
-|   Models   | Tensor Parallel | Quantization | Chat API | HF models examples |
-| :--------: | :-------------: | :----------: | :------: | :---------------------------:|
-|    Yi      |       Yes       |     Yes      |    No    |[01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B), [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K), [casperhansen/yi-6b-awq](https://huggingface.co/casperhansen/yi-6b-awq), [TheBloke/Yi-34B-GPTQ](https://huggingface.co/TheBloke/Yi-34B-GPTQ) |
-|   Llama2   |       Yes       |     Yes      |    Yes   | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
-|   Aquila   |       Yes       |     Yes      |    Yes   | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |
-|   Bloom    |       Yes       |     Yes      |    No    | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
-|   GPT_j    |       Yes       |     Yes      |    No    | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
-|  GPT_NeoX  |       Yes       |     Yes      |    No    | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
-|    GPT2    |       Yes       |     Yes      |    No    | [gpt2](https://huggingface.co/gpt2)|
-| InternLM   |       Yes       |     Yes      |    Yes   | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |
-|  Mistral   |       Yes       |     Yes      |    Yes   | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
-|    MPT     |       Yes       |     Yes      |    No    | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |
-
-If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on [GitHub Issues](https://github.com/vectorch-ai/ScaleLLM/issues).
-
 ## Quantization
 Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization ([APTQ](https://arxiv.org/abs/2210.17323)) and Activation-aware Weight Quantization ([AWQ](https://arxiv.org/abs/2306.00978)), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq. 
 
@@ -258,6 +270,7 @@ Quantization is a crucial process for reducing the memory footprint of models. S
 There are several known limitations we are looking to address in the coming months, including:
 
 - Only supports Hugging Face models with [fast tokenizers](https://github.com/huggingface/tokenizers).
+- Only supports GPUs that newer than Turing architecture.
 
 ## Contributing
 

diff --git a/src/model_loader/model_loader.cpp b/src/model_loader/model_loader.cpp
@@ -180,17 +180,22 @@ bool HFModelLoader::load_model_args(const std::string& model_weights_path) {
     return false;
   }
 
+  std::string model_type;
   if (auto data = reader.value<std::string>("model_type")) {
-    args_.model_type() = data.value();
+    model_type = data.value();
   } else {
     GLOG(ERROR) << "Failed to find model_type in " << args_file_path;
     return false;
   }
 
-  auto args_loader = ModelRegistry::get_model_args_loader(args_.model_type());
+  // override model type from gflag if exists
+  if (!FLAGS_model_type.empty()) {
+    model_type = FLAGS_model_type;
+  }
+  auto args_loader = ModelRegistry::get_model_args_loader(model_type);
   if (args_loader == nullptr) {
     GLOG(ERROR) << "Failed to find model args loader for model type "
-                << args_.model_type();
+                << model_type;
     return false;
   }
   args_loader(reader, &args_);

diff --git a/src/models/args.h b/src/models/args.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <optional>
+#include <unordered_set>
 
 #include "common/arg.h"
 #include "common/process_group.h"
@@ -83,6 +84,9 @@ struct ModelArgs {
 
   // whether to apply residual connection post layernorm
   DEFINE_ARG(bool, residual_post_layernorm) = false;
+
+  // Stop token ids
+  DEFINE_ARG(std::unordered_set<int32_t>, stop_token_ids);
 };
 
 inline std::ostream& operator<<(std::ostream& os, const ModelArgs& args) {

diff --git a/src/models/huggingface/llama.h b/src/models/huggingface/llama.h
@@ -141,7 +141,6 @@ class LlamaAttentionImpl : public torch::nn::Module {
                         torch::Tensor positions,
                         KVCache& kv_cache,
                         const InputParameters& input_params) {
-    const auto num_tokens = x.size(0);
     // (num_tokens, dim) x (dim, n_local_heads * head_dim)
     // => (num_tokens, n_local_heads * head_dim)
     auto qkv = qkv_proj_(x).split(/*split_size=*/qkv_sizes_, /*dim=*/-1);

diff --git a/src/models/huggingface/yi.h b/src/models/huggingface/yi.h
@@ -2,6 +2,8 @@
 
 #include <torch/torch.h>
 
+#include <unordered_set>
+
 #include "layers/activation.h"
 #include "layers/attention_rope.h"
 #include "layers/embedding.h"
@@ -190,34 +192,39 @@ class YiDecoderLayerImpl : public torch::nn::Module {
         YiAttention(args, quant_args, parallel_args, dtype, device));
     mlp_ = register_module(
         "mlp", YiMLP(args, quant_args, parallel_args, dtype, device));
-    ln1_ = register_module(
-        "ln1", RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
-    ln2_ = register_module(
-        "ln2", RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
+    input_layernorm_ = register_module(
+        "input_layernorm",
+        RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
+    post_attention_layernorm_ = register_module(
+        "post_attention_layernorm",
+        RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
   }
 
   torch::Tensor forward(torch::Tensor x,
                         torch::Tensor positions,
                         KVCache& kv_cache,
                         const InputParameters& input_params) {
-    auto h = x + self_attn_(ln1_(x), positions, kv_cache, input_params);
-    return h + mlp_(ln2_(h));
+    auto h =
+        x + self_attn_(input_layernorm_(x), positions, kv_cache, input_params);
+    return h + mlp_(post_attention_layernorm_(h));
   }
 
   // load the weight from the checkpoint
   void load_state_dict(const StateDict& state_dict) {
     // call each submodule's load_state_dict function
     self_attn_->load_state_dict(state_dict.select("self_attn."));
     mlp_->load_state_dict(state_dict.select("mlp."));
-    ln1_->load_state_dict(state_dict.select("ln1."));
-    ln2_->load_state_dict(state_dict.select("ln2."));
+    input_layernorm_->load_state_dict(state_dict.select("input_layernorm."));
+    post_attention_layernorm_->load_state_dict(
+        state_dict.select("post_attention_layernorm."));
   }
 
   void verify_loaded_weights(const std::string& prefix) const {
     self_attn_->verify_loaded_weights(prefix + "self_attn.");
     mlp_->verify_loaded_weights(prefix + "mlp.");
-    ln1_->verify_loaded_weights(prefix + "ln1.");
-    ln2_->verify_loaded_weights(prefix + "ln2.");
+    input_layernorm_->verify_loaded_weights(prefix + "input_layernorm.");
+    post_attention_layernorm_->verify_loaded_weights(
+        prefix + "post_attention_layernorm.");
   }
 
  private:
@@ -226,9 +233,9 @@ class YiDecoderLayerImpl : public torch::nn::Module {
 
   YiMLP mlp_{nullptr};
 
-  RMSNorm ln1_{nullptr};
+  RMSNorm input_layernorm_{nullptr};
 
-  RMSNorm ln2_{nullptr};
+  RMSNorm post_attention_layernorm_{nullptr};
 };
 TORCH_MODULE(YiDecoderLayer);
 
@@ -357,8 +364,36 @@ class YiForCausalLMImpl : public torch::nn::Module {
 };
 TORCH_MODULE(YiForCausalLM);
 
+class YiDialog final : public Dialog {
+ public:
+  // generate prompt from dialogs
+  // https://huggingface.co/01-ai/Yi-34B-Chat/blob/main/tokenizer_config.json#L60
+  // Prompt template:
+  // <|im_start|>user\n {message} <|im_end|>\n
+  // <|im_start|>assistant\n
+  std::optional<std::string> get_prompt() const override {
+    // at least one user message
+    if (messages_.size() % 2 == 0) {
+      return std::nullopt;
+    }
+
+    std::stringstream ss;
+    // Sounds Yi model doesn't support system message?
+
+    // then user and assistant message pairs (u/a/u/a/u...)
+    for (size_t i = 0; i < messages_.size(); ++i) {
+      const char* role = (i % 2) == 0 ? "user" : "assistant";
+      ss << "<|im_start|>" << role << "\n" << messages_[i] << "<|im_end|>\n";
+    }
+    // end with assistant message
+    ss << "<|im_start|>assistant\n";
+    return ss.str();
+  }
+};
+
 // register the causal model
 REGISTER_CAUSAL_MODEL(Yi, YiForCausalLM);
+REGISTER_DIALOG(Yi, YiDialog);
 // register the model args
 // example config:
 // https://huggingface.co/01-ai/Yi-6B/blob/main/config.json
@@ -377,6 +412,9 @@ REGISTER_MODEL_ARGS(Yi, [&] {
   LOAD_ARG_OR(eos_token_id, "eos_token_id", 2);
   LOAD_ARG_OR(rope_theta, "rope_theta", 5000000.0f);
   LOAD_ARG_OR(rope_scaling, "rope_scaling", 1.0f);
+
+  // stop token ids: "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
+  LOAD_ARG_OR(stop_token_ids, "", std::unordered_set<int32_t>({2, 6, 7, 8}));
 });
 
 }  // namespace llm::hf
diff --git a/src/request/sequence.cpp b/src/request/sequence.cpp
@@ -46,12 +46,17 @@ bool Sequence::check_stopping_creteria() {
     return is_finished_ = true;
   }
 
+  const auto last_token_id = token_ids_.back();
   if (!stopping_criteria_->ignore_eos_token &&
-      token_ids_.back() == stopping_criteria_->eos_token_id) {
+      last_token_id == stopping_criteria_->eos_token_id) {
+    finish_reason_ = FinishReason::STOP;
+    return is_finished_ = true;
+  }
+  // check against stop tokens ids
+  if (stopping_criteria_->stop_token_ids.count(last_token_id) > 0) {
     finish_reason_ = FinishReason::STOP;
     return is_finished_ = true;
   }
-  // TODO: Add other stopping criterias
 
   return false;
 }

diff --git a/src/request/stopping_criteria.h b/src/request/stopping_criteria.h
@@ -1,12 +1,14 @@
 #pragma once
 
 #include <cstdint>
-#include <vector>
 #include <string>
+#include <unordered_set>
+#include <vector>
 
 namespace llm {
 
-// StoppingCriteria is used to specify stopping criterias for a request/sequence.
+// StoppingCriteria is used to specify stopping criterias for a
+// request/sequence.
 struct StoppingCriteria {
   // maximum number of generated tokens
   size_t max_tokens = 0;
@@ -17,9 +19,11 @@ struct StoppingCriteria {
   // whether to ignore eos token when checking stopping criterias
   bool ignore_eos_token = false;
 
+  // stop token ids
+  std::unordered_set<int32_t> stop_token_ids;
+
   // stop sequences
   // std::vector<std::string> stop_sequences;
-
 };
 
 }  // namespace llm
diff --git a/src/sampling/logits_processor_test.cpp b/src/sampling/logits_processor_test.cpp
@@ -15,7 +15,7 @@ TEST(LogitsProcessorTest, Temperature) {
   TemperatureLogitsProcessor processor(temperatures, dtype, device);
 
   int64_t batch_size = 2;
-  int64_t vocab_size = 5;
+  int64_t vocab_size = 32000;
   auto logits = torch::randn({batch_size, vocab_size},
                              torch::dtype(dtype).device(device));
   auto token_ids = torch::randint(/*high=*/vocab_size,

diff --git a/src/server/handlers/chat_handler.cpp b/src/server/handlers/chat_handler.cpp
@@ -161,6 +161,7 @@ std::unique_ptr<Request> grpc_request_to_request(ChatCallData* call_data,
   stopping_criteria.max_tokens = max_tokens;
   // stopping_criteria.ignore_eos_token = false;
   stopping_criteria.eos_token_id = model_args.eos_token_id();
+  stopping_criteria.stop_token_ids = model_args.stop_token_ids();
 
   if (grpc_request.has_stream()) {
     request->stream = grpc_request.stream();