Skip to content

Releases: triton-inference-server/hugectr_backend

Merlin: HugeCTR&HPS Backend V23.08

26 Sep 09:55
Compare
Choose a tag to compare

What's New in Version 23.08

  • Hierarchical Parameter Server:
    • Support static EC fp8 quantization
      We already support quantization for fp8 in the static cache. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable fp8_quant configuration, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction.
    • Large model deployment demo based on HPS TensorRT-plugin
      This demo shows how to use the HPS TRT-plugin to build a complete TRT engine for deploying a 147GB embedding table based on a 1TB Criteo dataset. We also provide static embedding implementation for fully offloading embedding tables to host page-locke memory for benchmarks on x86 and Grace Hopper Superchip.
    • Issues Fixed
      • Resolve Kafka update ingestion error. There was an error that prevented handing over online parameter updates coming from Kafka message queues to Redis database backends.
      • Fixed HPS Triton backend re-initializing the embedding cache issue due to undefined null when getting the embedded cache on the corresponding device.

Merlin: HPS Backend V23.09

26 Sep 09:56
Compare
Choose a tag to compare
Pre-release
Merge branch 'release-23.09' into 'main'

Deprecate the HugeCTR Triton Backend

See merge request dl/hugectr/hugectr_inference_backend!106

Merlin: HugeCTR Backend V4.3 (22.12)

31 Dec 13:16
7a04aa5
Compare
Choose a tag to compare

Release Notes

What’s New in Version 4.3

  • HPS Improvements+
    • RedisClusterBackend now supports TLS/SSL authentication. The hps_demo.ipynb notebook has been extended with step-by-step instructions to show you how to setup HPS in order to use Redis with (and without) encryption.
    • MultiProcessHashMapBackend: Fixed bug that prevented configuring the shared memory size when using JSON-file-based configuration.
    • On-device input keys are supported now, so that an extra host-to-device copy is removed.
    • Removed the dependency to XX-Hash library, which is not used by HugeCTR anymore.
    • Added the static table support to the embedding cache. The static table is suitable for the case where the embedding table can be placed entirely in GPU memory, in this case, the static table is more than three times faster than embedding cache lookup. The static table does not support embedding updates.
    • Added sample of using HPS backend with Pytorch for inference via Triton Ensemble mode

Merlin: HugeCTR Backend V4.2

16 Nov 14:28
8fe3b4a
Compare
Choose a tag to compare

Release Notes

What’s New in Version 4.2

  • Change to HPS with Redis or Kafka:
    This release includes a change to Hierarchical Parameter Server and affects deployments that use Redis or model parameter streaming with Kafka.
    A third-party library that was used for HPS partition selection algorithm is replaced to improve performance.
    The new algorithm can produce different partition assignments for volatile databases.
    As a result, volatile database backends that retain data between application startup, such as Redis, must be reinitialized.
    Model streaming with Kafka is equally affected.
    To avoid issues with updates, reset all respective queue offsets to the end_offset before you reinitialize the Redis database.

  • New Volatile Database Type for HPS:
    This release adds a db_type value of multi_process_hash_map to the Hierarchical Parameter Server.
    This database type supports sharing embeddings across process boundaries by using shared memory and the /dev/shm device file.
    Multiple processes running HPS can read and write to the same hash map.
    For an example, refer to the Hierarchcal Parameter Server Demo notebook.

  • Enhancements to the HPS Redis Backend:
    In this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node.
    This enhancement enables HPS to take advantage of overlapped processing optimizations in the I/O module of Redis servers.
    In addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata.
    This enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions.
    As a result, the enhancement increases the insertion performance.

  • Issues Fixed:

    • An error in HPS with the lookup_fromdlpack() method is fixed.
      The error was related to calculating the number of keys and vectors from the corresponding DLPack tensors.
    • An error in the HugeCTR backend for Triton Inference Server is fixed.
      A crash was triggered when the initial size of the embedding cache is smaller than the allowed minimum size.

Merlin: HugeCTR Backend V4.1

25 Oct 13:36
6157dec
Compare
Choose a tag to compare

Release Notes

What’s New in Version 4.1

  • On-Device Input Keys for HPS Lookup:
    The HPS lookup supports input embedding keys that are on GPU memory during inference.
    This enhancement removes a host-to-device copy by using the DLPack lookup_fromdlpack() interface.
    By using the interface, the input DLPack capsule of the embedding key can be a GPU tensor.

  • Issues Fixed:

    • Fix the issue that the HPS backend returns unexpected results .
      This problem is caused by overlapping multiple embedding table outputs.

Merlin: HugeCTR Backend V4.0

22 Sep 08:52
Compare
Choose a tag to compare

Release Notes

What’s New in Version 4.0

  • Embedding Cache Initialization with Configurable Ratio:
    In previous releases, the default value for the cache_refresh_percentage_per_iteration parameter of the InferenceParams was 0.1.

    In this release, default value is 0.0 and the parameter provides an additional purpose.
    If you set the parameter to a value greater than 0.0 and also set use_gpu_embedding_cache to True for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model.
    When embedding cache initialization is used, HPS creates log records when it starts at the INFO level.
    The logging records are similar to EC initialization for model: "<model-name>", num_tables: <int> and EC initialization on device: <int>.
    This enhancement reduces the duration of the warm up phase.

  • Issues Fixed:

    • Fix the issue that the max_batch_size configuration of the HPS backend is invalid.
      This problem is caused by incomplete parsing of the ps.json configuration file.

Merlin: HugeCTR Backend V3.9

17 Aug 02:17
af94d79
Compare
Choose a tag to compare

Release Notes

What’s New in Version 3.9

  • Enhancements to the HPS Backend for Triton Inference Server
    This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server. The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS.
    For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub.

  • API Enhancements HPS Database Backend

    • HPS DatabaseBackend APIs have been extended. DatabaseBackends now allow supplying a maximum time-budget to queries to allow building applications that must operate within strict latency limits. Fetch queries return the execution control to the caller once the budget has been exhausted. Unprocessed entries are indicated to the caller through a callback function.
    • DatabaseBackends now provide 2 new APis. Namely dump and load_dump, which allow you to respectively dump and load embedding tables from disk. We support a custom binary format and the RocksDB SST table file format. This also means that you can now import embedding table data directly from your custom tools into HugeCTR and vice-versa.
    • The new find_tables API allows you to discover all table data that is currently stored for a model in a DatabaseBackend. A new overload for the evict API was added which can process results from find_tables to quickly drop all stored information related to a model.
  • Documentation Enhancements:
    The unified documentation details for HugeCTR Hierarchical Parameter Server are updated for consistency and clarity.

Merlin: HugeCTR Backend V3.7

14 Jun 09:28
0ad141e
Compare
Choose a tag to compare

Release Notes

What’s New in Version 3.7

  • HPS Performance Improvements:

    • Kafka: Model parameters are now stored in Kafka in a bandwidth-saving multiplexed data format.
      This data format vastly increases throughput. In our lab, we measured transfer speeds up to 1.1 Gbps for each Kafka broker.
    • HashMap backend: Parallel and single-threaded hashmap implementations have been replaced by a new unified implementation.
      This new implementation uses a new memory-pool based allocation method that vastly increases upsert performance without diminishing recall performance.
      Compared with the previous implementation, you can expect a 4x speed improvement for large-batch insertion operations.
    • Suppressed log: Most log messages related to HPS have the log level changed to TRACE rather than INFO or DEBUG to reduce logging verbosity. Users can configure multi-level log output when the Triton service is launched, thereby improving the throughput of online inference.
    • Simplified configuration: Hugectr backend has completely decoupled the inference Parameter Server-related configuration(ps.json) and the Triton configuration (config.pbtxt)​​, avoiding the repeated configuration in Triton.
    • Freeze embedding update: The HugeCTR backend has supported updating only the dense part of the model through Triton's model control interface, thus avoiding the online update of the embedding from being updated repeatedly. See Model Repository Extension.
  • Documentation Enhancements:

Merlin: HugeCTR Backend V3.6

29 Apr 13:37
1934f11
Compare
Choose a tag to compare

Release Notes

What’s New in Version 3.6

  • Documentation Enhancements:

    • The Configuration section of the Hierarchical Parameter Server information is updated with more information about the parameters in the configuration file.
  • Issues Fixed:

    • Hierarchical Parameter Server (HPS) would produce a runtime error when the GPU cache was turned off.
      This issue is now fixed.

Merlin: HugeCTR Backend V3.5

04 Apr 14:25
9975c81
Compare
Choose a tag to compare

Release Notes

What’s New in Version 3.5

  • Hierarchical Parameter Server (HPS) Triton Backend:
    The Hierarchical Parameter Server(HPS) Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. For more information, please refer to Hierarchical Parameter Server.

  • HPS interface encapsulation and exporting as library:
    We encapsulate the HPS interfaces and deliver it as a standalone library. Besides, we provide HPS Python APIs and demonstrate the usage with a notebook. For more information, please refer to the HPS Demo.

  • HPS performance optimization: use better method to determine partition number in database backends in HPS