-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192
Comments
Hello, I wrote a very simple FastAPI script to run the faster whisper module. The script is available here. https://github.com/heimoshuiyu/whisper-fastapi I use Docker to deploy my service. I have 4 RTX 4070 Ti Super GPUs, and I deploy 2 services on each GPU. So in total, I have 8 services. Each service is mapped to one client, and I set up the Grafana GPU monitor, which indicates all GPUs are utilized at 100%. I am using the large v2 module, and almost all my transcription tasks are longer than 1 to 3 minutes of audio. The feature extraction preprocessing part isn't wasting too much GPU time, I think. In my case, two services per GPU is enough. You might consider using more service per GPU if you are transcribing shorter audio. |
Thank you very much for the response. I will take a look at your solution as soon as I can. |
@heimoshuiyu unfortunately that script is not utilizing the gpus correctly even if it shows 100% utilization |
Hi @MahmoudAshraf97, sorry to bother you. Do you have any suggestions on how to efficiently transcribe multiple audio sources in parallel using Faster Whisper? I’d appreciate any insights or recommendations. |
Hi, i'm currently working on my thesis, which involves building a real-time transcription server using the
whisper-streaming
project andfaster-whisper
for the ASR backend. The server is deployed on an RTX 6000 Ada GPU, but I am struggling to achieve proper GPU parallelism.I am relatively new to using Whisper and have only recently started using Python. I appreciate your patience and any guidance you can provide!
I Tried
Multiple Models on Multiple Threads:
WhisperModel
instances (one per thread) and assigned each client to its own model. While this approach works for a few clients, performance degrades significantly beyond ~8 clients, regardless of the model size. Visually what seems to be happening to me is that the models are competing with each other for the entire GPU resources.Single Shared Model with
num_workers
:WhisperModel
instance among multiple threads and used thenum_workers
parameter to enable concurrent processing. This approach also works well initially but similarly fails to handle more than ~8 clients effectively, again with the same issues.faster-whisper
?num_workers
parameter have any impact on GPU-based inference, or is it exclusively for CPU execution?Any advice or clarification would be greatly appreciated. Thank you for your amazing work on this project!
The text was updated successfully, but these errors were encountered: