Merge V3.6 integration with main (#41)

* support multi-hot cat input * Update CI.DockerFile to latest v3.5-integration * support standalone hps lib * change to new library name * Build the backbone of the hps backend * finish backbone building * make hugectr backend compatible with new HPS * Modify hugectr inference backend to adapt to new HPS * update wdl model tutorial * bug fix * reset other notebooks * model replacement * delete checkoutpoint files * Modify hps backend to adapt to new HPS * support single embedding table lookup for hps backend * add triton helpers file * support new hps ctor API * support multi-tables embedding keys query * add a brief introduction about HPS and use cases * Update CI.DockerFile with hugectr master branch * Fix typos * fix docker run command * Add detailed introduction about hps interface configuration, etc. * resize image * Upload configuration management poc demo * add UI figure * Config sys demo comments * fix issue about disable ec cause crash * add test case for switch off ec * Fix links in README * delete cm related * update hps docs * merge history * update docker version Co-authored-by: kingsleyl <kingsleyl@nvidia.com> Co-authored-by: zhuwenjing <zhuwenjing@360.cn> Co-authored-by: Joey Wang <zehuanw@nvidia.com> Co-authored-by: Jerry Shi <jershi@nvidia.com>
triton-inference-server · Apr 29, 2022 · 1934f11 · 1934f11
1 parent 9975c81
commit 1934f11
Show file tree

Hide file tree

Showing 11 changed files with 60 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -68,7 +68,7 @@ git clone https://github.com/NVIDIA/HugeCTR.git
 cd HugeCTR
 git submodule update --init --recursive
 ```
-For more information, see [Building HugeCTR from Scratch](https://github.com/NVIDIA/HugeCTR/blob/master/docs/hugectr_user_guide.md#building-hugectr-from-scratch).
+For more information, see [Building HugeCTR from Scratch](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_user_guide.html#building-hugectr-from-scratch).
 
 After you've built HugeCTR from scratch, do the following:
 1. Download the HugeCTR Backend repository by running the following commands:
@@ -99,7 +99,7 @@ Since the HugeCTR Backend is a customizable Triton component, it is capable of s
 From V3.3.1, HugeCTR Backend is fully compatible with the [Model Control EXPLICIT Mode](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) of Triton. Adding the configuration of a new model to the HPS configuration file. The HugeCTR Backend has supported online deployment of new models by [the load API](https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#load) of Triton. The old models can also be recycled online by [the unload API](https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#unload).
 
 The following should be noted when using Model Repository Extension functions:  
- - Depoly new models online: [The load API](https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#load) will load not only the network dense weight as part of the HugeCTR model, but inserting the embedding table of new models to Hierarchical Inference Parameter Server and creating the embedding cache based on model definition in [Independent Parameter Server Configuration](./Independent_Parameter_Server_Configuration), which means the Parameter server will independently provide an initialization mechanism for the new embedding table and embedding cache of new models.
+ - Depoly new models online: [The load API](https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#load) will load not only the network dense weight as part of the HugeCTR model, but inserting the embedding table of new models to Hierarchical Inference Parameter Server and creating the embedding cache based on model definition in [Independent Parameter Server Configuration](https://gitlab-master.nvidia.com/dl/hugectr/hugectr_inference_backend/-/tree/main/hps_backend#independent-inference-hierarchical-parameter-server-configuration), which means the Parameter server will independently provide an initialization mechanism for the new embedding table and embedding cache of new models.
 
  - Update the deployed model online: [The load API](https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#load) will load the network dense weight as part of the HugeCTR model and updating the embedding tables of the latest model file to Inference Hierarchical Parameter Server and refreshing the embedding cache, which means the Parameter server will independently provide an updated mechanism for existing embedding tables.
 
@@ -151,7 +151,6 @@ The configuration file of inference Parameter Server should be formatted using t
 ```
 
 ## HugeCTR Inference Hierarchical Parameter Server 
-
 HugeCTR Inference Hierarchical Parameter Server implemented a hierarchical storage mechanism between local SSDs and CPU memory, which breaks the convention that the embedding table must be stored in local CPU memory. `volatile_db Database` layer allows utilizing Redis cluster deployments, to store and retrieve embeddings in/from the RAM memory available in your cluster. The `Persistent Database` layer links HugeCTR with a persistent database. Each node that has such a persistent storage layer configured retains a separate copy of all embeddings in its locally available non-volatile memory. see [Distributed Deployment](docs/architecture.md#distributed-deployment-with-hierarchical-hugectr-parameter-server) and [HugeCTR Inference Hierarchical Parameter Server](hps_backend/docs/hierarchical_parameter_server.md) for more details.
 
 In the following table, we provide an overview of the typical properties different parameter database layers (and the embedding cache). We emphasize that this table is just intended to provide a rough orientation. Properties of actual deployments may deviate.

diff --git a/hps_backend/README.md b/hps_backend/README.md
@@ -30,7 +30,7 @@
 
 # Hierarchical Parameter Server Backend
 
-The Hierarchical Parameter Server(HPS) Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. The HPS Backend supports executing multiple embedding vector looking-up services concurrently across multiple GPUs by embedding cache that is shared between multiple lookup sessions. For more information, see [Hierarchical Parameter Server Architecture](docs/architecture.md#hugectr-inference-framework).  
+The Hierarchical Parameter Server(HPS) Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. The HPS Backend supports  executing multiple embedding vector looking-up services concurrently across multiple GPUs by embedding cache that is shared between multiple look_up sessions. For more information, see [Hierarchical Parameter Server Architecture](docs/architecture.md#hugectr-inference-framework).  
 
 ## Quick Start
 You can build the HPS Backend from scratch and install to the specify path based on your own specific requirements using the NGC Merlin inference Docker images.
@@ -57,7 +57,7 @@ All NVIDIA Merlin components are available as open-source projects. However, a m
 
 Docker images for the HugeCTR Backend are available in the NVIDIA container repository on https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-inference. You can pull and launch the container by running the following command:
 ```
-docker run --gpus=1 --rm -it nvcr.io/nvidia/merlin/merlin-inference:22.04  # Start interaction mode  
+docker run --gpus=1 --rm -it nvcr.io/nvidia/merlin/merlin-inference:22.05  # Start interaction mode  
 ```
 
 **NOTE**: As of HugeCTR version 3.0, the HugeCTR container is no longer being released separately. If you're an advanced user, you should use the unified Merlin container to build the HugeCTR Training or Inference Docker image from scratch based on your own specific requirements. You can obtain the unified Merlin container by logging into NGC or by going [here](https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/inference/dockerfile.ctr). 
@@ -69,7 +69,8 @@ git clone https://github.com/NVIDIA/HugeCTR.git
 cd HugeCTR
 git submodule update --init --recursive
 ```
-For more information, see [Building HugeCTR from Scratch](https://github.com/NVIDIA/HugeCTR/blob/master/docs/hugectr_user_guide.md#building-hugectr-from-scratch).
+
+For more information, see [Building HugeCTR from Scratch](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_user_guide.html#building-hugectr-from-scratch).
 
 After you've built HugeCTR from scratch, do the following:
 1. Download the HugeCTR Backend repository by running the following commands:

diff --git a/hps_backend/docs/architecture.md b/hps_backend/docs/architecture.md
@@ -10,14 +10,14 @@ The HPS Backend also offers the following:
 
 * **Concurrent Model Execution**: Multiple lookup sessions of the same embedding tables can run simultaneously on the same GPU or multiple GPUs.
 * **Extensible Backends**: The Inference interface provided by HugeCTR can be easily integrated with the backend API, which allows models to be extended with any execution logic using Python or C++.
-* **Easy Deployment of New Embedding Tables**: Updating a embedding tables should be as transparent as possible and should not affect the lookup performance. This means that no matter how many embedding tables of different models need to be deployed, as long as the training embedding table conforms to the input format of key-value pairs, it can be deployed in the HPS backend to provide efficient embedding vector lookup services, they can be loaded through the same HPS backend API. **Note**: It might be necessary to update configuration files for each model in some cases. For the specific input format of the embedding table, please refer to [here](#variant-compressed-sparse-row-input)
+* **Easy Deployment of New Embedding Tables**: Updating a embedding tables should be as transparent as possible and should not affect the lookup performance. This means that no matter how many embedding tables of different models need to be deployed, as long as the training embedding table conforms to the input format of key-value pairs, it can be deployed in the HPS backend to provide efficient embedding vector lookup services, they can be loaded through the same HPS backend API. **Note**: It might be necessary to update configuration files for each model in some cases. For the specific input format of the embedding table, please refer to [here](#hierarchical-parameter-server-input-format)
 
 ### HPS Backend Framework
 The following components make up the HugeCTR Backend framework:
 
 * The **Hierarchical Database Backend** is responsible for loading and managing large embedding tables that belong to different models. The embedding tables provide syncup and update services for the embedding cache. It also ensures that the embedding table is completely loaded and updated regularly.
 * The **Embedding Cache** can be loaded directly into the GPU memory. Thereby, it provides embedding vector lookup functionalities for the model, thus, avoiding the comparatively high latencies incurred when transferring data  from the Parameter Server (CPU and GPU transfer). It also provides the update mechanism for loading the latest cached embedding vector in time to ensure a high hit througput rate.
-* The **Lookup Session/Instance** can interact directly with the embedding cache in the GPU memory to obtain embedding vectors. Based on the hierarchical design structure, multiple lookup instances will share embedding cache in the GPU memory to enforce concurrent lookup execution. Based on the dependencies of the hierarchical storage, GPU caching completely decouples embedding vectors lookup services from the host,which implements an efficient and low-latency lookup operation by relying on the GPU embedding cache. While it also possible to implement inference logic using hierarchical initialization and dependency injection.  
+* The **Lookup Session/Instance** can interact directly with the embedding cache in the GPU memory to obtain embedding vectors. Based on the hierarchical design structure, multiple lookup instances will share embedding cache in the GPU memory to enforce concurrent lookup execution. Based on the dependencies of the hierarchical storage, GPU caching completely decouples embedding vectors lookup services from the host,which implements an efficient and low-latency lookup operation relying on the GPU embedding cache. Meanwhile, the Lookup Session/Instance is thread-safe and can be launched concurrently with multiple threads to improve the QPS of the inference service. 
 
 Here's an in-depth look into the design framework of the HugeCTR Inference interface:
 
@@ -66,7 +66,7 @@ The following parameters have to be set in the ps.json file for the HugeCTR Back
    * If the real hit rate of the GPU embedding cache lookup embedding keys is lower than the user-defined threshold, the GPU embedding cache will insert the missing vector into the embedding cache synchronously. 
    * If the real hit rate of the GPU embedding cache lookup is greater than the threshold set by the user, the GPU embedding cache will choose to insert the missing key asynchronously.  
 
-   The hit rate threshold must be set in the Parameter Server JSON file. For example, see [HugeCTR Backend configuration]( ../samples/README.md#HugeCTR_Backend_configuration). 
+   The hit rate threshold must be set in the Parameter Server JSON file. For more details, see [HugeCTR HPS Configuration Book]( https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/docs/source/hugectr_parameter_server.md#configuration-book). 
 
 
 When the GPU embedding cache mechanism is disabled (i.e., `"gpucache"` is set to `false`), the model will directly look up the embedding vector from the Parameter Server. In this case, all remaining settings pertaining to the GPU embedding cache will be ignored.
@@ -95,8 +95,12 @@ When the GPU embedding cache mechanism is disabled (i.e., `"gpucache"` is set to
 * **deployed_device_list**: Indicate the list of devices used to deploy HPS.
 * **max_batch_size**: This item specifies the number of samples per lookup request.
 * **embedding_vecsize_per_table**: This item determines the pre-allocated memory size on the host and device.  For the case of multiple embedding tables, we assume that the size of the embedding vector in each embedding table is different, then this configuration item requires the user to fill in each embedding table with maximum vector size. 
-* **maxnum_catfeature_query_per_table_per_sample**: This item  determines the pre-allocated memory size on the host and device. We assume that for each input sample, there is a maximum number of embedding keys per sample in each embedding table that need to be looked up, so the user needs to configure the maximum number of queries embedding keys per embedded table in this item.
-* **embedding_table_names**: This configuration item needs to be filled with the name of each embedded table, which will be used to name the data partition and data table in the hierarchical database backend.
+* **maxnum_catfeature_query_per_table_per_sample**: List[Int], this item determines the pre-allocated memory size on the host and device. We assume that for each input sample, there is a maximum number of embedding keys per sample in each embedding table that need to be looked up, so the user needs to configure the [ Maximum(the number of embedding keys that need to be queried from embedding table 1 in each sample), Maximum(the number of embedding keys that need to be queried from embedding table 2 in each sample), ...] in this item. This is a mandatory configuration item.
+
+* **maxnum_des_feature_per_sample**: Int, each sample may contain a varying number of numeric (dense) features. This item so the user needs to configure the value of Maximum(the number of dense feature in each sample) in this item, which determines the pre-allocated memory size on the host and device. The default value is `26`.
+
+* **embedding_table_names**: List[String], this configuration item needs to be filled with the name of each embedded table, which will be used to name the data partition and data table in the hierarchical database backend. The default value is `["sparse_embedding1", "sparse_embedding2", ...]`
+
 
 ## Distributed Deployment with Hierarchical HugeCTR Parameter Server ##
 The hierarchical HugeCTR parameter server (PS) allows deploying models that exceed the existing GPU memory space, while retaining a relatively low latency. To provide this functionality, our PS exhibits a hierarchical structure that can make use of the various memory resources of each cluster node. In other words, the hierachical parameter server utilizes Random Access Memory (RAM) and non-volatile memory resources in your cluster to extend the embedding cache and allow faster response times for ver large datasets.

diff --git a/samples/README.md b/samples/README.md
@@ -45,12 +45,12 @@ You can pull the `Merlin-Training` container by running the following command:
 DLRM model traning:
 
 ```
-docker run --gpus=all -it --cap-add SYS_NICE -v ${PWD}:/dlrm_train/ --net=host nvcr.io/nvidia/merlin/merlin-training:22.02 /bin/bash
+docker run --gpus=all -it --cap-add SYS_NICE -v ${PWD}:/dlrm_train/ --net=host nvcr.io/nvidia/merlin/merlin-training:22.05 /bin/bash
 ```
 
 Wide&Deep model training:
 ```
-docker run --gpus=all -it --cap-add SYS_NICE -v ${PWD}:/wdl_train/ --net=host nvcr.io/nvidia/merlin/merlin-training:22.02 /bin/bash
+docker run --gpus=all -it --cap-add SYS_NICE -v ${PWD}:/wdl_train/ --net=host nvcr.io/nvidia/merlin/merlin-training:22.05 /bin/bash
 ```
 
 The container will open a shell when the run command execution is completed. You'll have to start the jupyter lab on the Docker container. It should look similar to this:
@@ -122,12 +122,12 @@ mkdir -p wdl_infer
 
 DLRM model inference container:
 ```
-docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v dlrm_infer:/dlrm_infer/ -v dlrm_train:/dlrm_train/ nvcr.io/nvidia/merlin/merlin-inference:21.09
+docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v dlrm_infer:/dlrm_infer/ -v dlrm_train:/dlrm_train/ nvcr.io/nvidia/merlin/merlin-inference:22.05
 ```
 
 Wide&Deep model inference container:
 ```
-docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v wdl_infer:/wdl_infer/ -v wdl_train:/wdl_train/ nvcr.io/nvidia/merlin/merlin-inference:21.09
+docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v wdl_infer:/wdl_infer/ -v wdl_train:/wdl_train/ nvcr.io/nvidia/merlin/merlin-inference:22.05
 ```
 The container will open a shell when the run command execution is completed. It should look similar to this:
 ```

diff --git a/samples/dlrm/HugeCTR_DLRM_Inference.ipynb b/samples/dlrm/HugeCTR_DLRM_Inference.ipynb
@@ -310,7 +310,7 @@
     "\n",
     "In this tutorial, we will deploy the DLRM to a single V100(32GB)\n",
     "\n",
-    "docker run --gpus=all -it -v /dlrm_infer/:/dlrm_infer -v /dlrm_train/:/dlrm_train --net=host nvcr.io/nvidia/merlin/merlin-inference:22.02 /bin/bash\n",
+    "docker run --gpus=all -it -v /dlrm_infer/:/dlrm_infer -v /dlrm_train/:/dlrm_train --net=host nvcr.io/nvidia/merlin/merlin-inference:22.05 /bin/bash\n",
     "\n",
     "After you enter into the container you can launch triton server with the command below:\n",
     "\n",
@@ -743,24 +743,24 @@
       "*   Trying 127.0.0.1:8000...\r\n",
       "* TCP_NODELAY set\r\n",
       "* Connected to localhost (127.0.0.1) port 8000 (#0)\r\n",
-      "> GET /v2/health/ready HTTP/1.1\r",
+      "> GET /v2/health/ready HTTP/1.1\r\n",
       "\r\n",
-      "> Host: localhost:8000\r",
+      "> Host: localhost:8000\r\n",
       "\r\n",
-      "> User-Agent: curl/7.68.0\r",
+      "> User-Agent: curl/7.68.0\r\n",
       "\r\n",
-      "> Accept: */*\r",
+      "> Accept: */*\r\n",
       "\r\n",
-      "> \r",
+      "> \r\n",
       "\r\n",
       "* Mark bundle as not supporting multiuse\r\n",
-      "< HTTP/1.1 200 OK\r",
+      "< HTTP/1.1 200 OK\r\n",
       "\r\n",
-      "< Content-Length: 0\r",
+      "< Content-Length: 0\r\n",
       "\r\n",
-      "< Content-Type: text/plain\r",
+      "< Content-Type: text/plain\r\n",
       "\r\n",
-      "< \r",
+      "< \r\n",
       "\r\n",
       "* Connection #0 to host localhost left intact\r\n"
      ]

diff --git a/samples/hierarchical_deployment/README.md b/samples/hierarchical_deployment/README.md
@@ -144,7 +144,7 @@ mkdir -p wdl_infer
 
 Wide&Deep model inference container:
 ```
-docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v wdl_infer:/wdl_infer/ -v wdl_train:/wdl_train/ -v your_rocksdb_path:/wdl_infer/rocksdb/ nvcr.io/nvidia/merlin/merlin-inference:22.02
+docker run -it --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v wdl_infer:/wdl_infer/ -v wdl_train:/wdl_train/ -v your_rocksdb_path:/wdl_infer/rocksdb/ nvcr.io/nvidia/merlin/merlin-inference:22.05
 ```
 The container will open a shell when the run command execution is completed. It should look similar to this:
 ```