-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How to add new models to HPS configuration when using Model Control Mode EXPLICIT? #451
Comments
@dmac |
Correct.
I see, thanks. Do you know if that feature is planned or has been discussed before? Or is this question better directed at the Triton/TensorRT issues page? Have you seen anyone using a dynamic, constantly changing HPS configuration like this before, either with the HPS backend, or the TRT backend with the HPS TRT plugin? |
One more question: do models served by the Triton HPS backend need to be trained with the HugeCTR framework? Or is there a way to convert an existing ONNX model to a format that is able to be served by the HPS backend? (Currently we train a model with PyTorch, export to ONNX, then convert to TRT.) |
Yes, there are users who have such a requirement, so we support the feature of online updating or unloading models in HPS Triton backend. However, such users mainly use HPS as an independent embedding query service, such as building an HPS+TF/TRT inference pipeline using Triton the ensemble mode, so that they can independently control the online update of the embedding table in HPS backend.
The answer is not necessary, the input format of HPS can be found here here,If you need to convert the torch embedding model to hps format, you can refer to the example. |
Ok, it sounds like using an ensemble model with the embeddings looked up from the HPS backend instead of using the HPS TRT plugin is probably what we want. Thanks for the advice. |
My company runs a Triton fleet that serves multiple models which are continuously retrained and reloaded onto Triton using Model Control Mode EXPLICIT. These are organized into a few different model types, but our convention is that every instance of a model is given a unique name and we don't use the Triton notion of versions.
For example, we have model A training on some cadence and model B training on some cadence, and the latest version of each is loaded on Triton as soon as it's done training. A specific instance of a model is named like "A.1", and when the next iteration of that model is trained, it will be named "A.2", etc. We might also have multiple versions for each type of model loaded onto Triton to support clean transitions from one version of a model to the next. So, at a given moment in time, we might have the following models loaded: A.10, A.11, B.75.
In this example, say we introduce a brand new type of model called C. We also are able to retire model A.10 because it's no longer receiving requests. So at another moment in time we might have these models loaded: A.11, B.75, C.1.
I'm investigating whether it's possible to use the HPS plugin for TensorRT with our existing architecture, without needing to restart Triton whenever our list of active models changes. My understanding is that when we start Triton, we supply a JSON configuration file like:
And this file contains the configuration for every model, like:
I'm able to explicitly load TRT models that use the HPS plugin, but if I update the JSON file in place and attempt to make inferences on a new model named C.1, Triton reports an error:
Is there a way to supply new model configurations once Triton is already running, and to remove configurations that are no longer relevant? I was optimistic after reading this, which mentions "adding the configuration of a new model to the HPS configuration file," but I'm not sure how to do that.
Thanks!
The text was updated successfully, but these errors were encountered: