Are there any plans to integrate the inference of the canary model with tensorrt llm? #8899
Replies: 2 comments
-
cc: @galv |
Beta Was this translation helpful? Give feedback.
-
We are working on Parakeet in Triton Inference Server using the python backend right now, but not Canary: https://github.com/NVIDIA/NeMo/pull/8673/files#diff-11846c15f5c57c285b422afe55fafd8d6daedc60fa814371c81a73da0c39aa22 I haven't posted any results, but the throughputs being achieved are very close to the throughputs achieved by running transcribe_speech.py (like 1300 RTFx on an A100 on librispeech test other at batch size 32). You may note that there is a Whisper implementation in TensorRT-LLM. The TensorRT-LLM team is working on continuous batching, for encoder-decoder style models like Whisper, which is frankly required for optimal throughput in my view. It isn't done yet. (Continuous batching meanwhile is less important for Parakeet models, because the output sequence length distribution has less variance). The central problem is that you need a large batch size to fully saturate your GPU, but at large batch sizes, you will have a higher variance in output sequence lengths, which causes wasted computation in non-continuous-batching implementations. Anyway, the point is that this work on Whisper will be carried over to Canary as well, because they are the same style architecture, but I cannot give you any timeline on that. |
Beta Was this translation helpful? Give feedback.
-
It would be nice to have an inference example of the canary model with Triton, and maybe it is an optimized version with an efficient backend implementation such as Tensorrt-LLM.
Beta Was this translation helpful? Give feedback.
All reactions