Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add whisper.cpp (server) support to llamafile #517

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

cjpais
Copy link
Collaborator

@cjpais cjpais commented Jul 30, 2024

This PR adds whisper.cpp support to llamafile. This addresses #17 in part. Only the server binary has been ported in this PR.

Most of the work to support this was initially done on my fork of llamafile: whisperfile. This PR ports the code over, and structures it similarly to stable-diffusion.cpp support

The whisper.cpp code was taken from commit: 6739eb8

@cjpais cjpais changed the title Add whisper.cpp in llamafile Add whisper.cpp (server) support to llamafile Jul 30, 2024
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work getting the ball rolling on whisperfile @cjpais!

I'm happy to take over the maintenance burden for you.

@jart jart merged commit fd891be into Mozilla-Ocho:main Jul 31, 2024
2 checks passed
@jart
Copy link
Collaborator

jart commented Jul 31, 2024

Just pushed some improvements. Enjoy!

  • b3bdc62 main Make whisperfile 12% faster in GPU mode
  • 0849f32 Get CUDA and Metal GPU working in whisperfile
  • 94e9629 Add CLI program to whisperfile

@jart
Copy link
Collaborator

jart commented Jul 31, 2024

OK the good news is that whisperfile works outstandingly well with the "large" model on cpu.

image

However I'm going to disable GPU support by default because it isn't working reliably. Here's CUDA.

image

Here's Apple Metal:

image

I'm not sure what's wrong. It's possible the problem will solve itself the next time I synchronize with llama.cpp upstream. I also checked and it definitely wasn't my performance optimization in the last change. The tiny whisper-tiny.en-q5_1.bin model does seem to work great on GPU (you can pass the --gpu auto flag to try it). So maybe the issue is related to how whisper-medium.en.bin and whisper-large-v3.bin are encoded. Any ideas?

@cjpais
Copy link
Collaborator Author

cjpais commented Jul 31, 2024

Thank you @jart!! I must've missed porting some of the CUDA/Metal from whisperfile repo, appreciate you taking such quick care of it. Love the performance improvements on CPU as well!

It's possible could be something with the encoding, I've ran into some issues with the bin's before.. Also when I synced the whisper.cpp code I did not sanity check it (and I don't know if they are), so it may be an underlying issue? If you don't get to it first I'll take a closer look towards the end of the week/over the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants