Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for whisper.cpp #17

Closed
versae opened this issue Nov 30, 2023 · 22 comments
Closed

Support for whisper.cpp #17

versae opened this issue Nov 30, 2023 · 22 comments

Comments

@versae
Copy link

versae commented Nov 30, 2023

Any chance of adding support for whisper.cpp? I know whisper.cpp is still stuck with the GGML format instead of GGUF, but it would be great to have portable whisper binaries that just work.

@benwilcock
Copy link

I agree, Whisper is awesome. I used to use it at the command line (which was very slow). I now use this project, and the inference times are 5-10x faster: https://github.com/jhj0517/Whisper-WebUI

@asmith26
Copy link

asmith26 commented Dec 8, 2023

I would also love speech input support for this, not only because it would be a really cool feature, but also because I sometimes get a bit of RSI so anything to help reduce the amount of typing needed is very helpful.

@Kreijstal
Copy link

yeah this would be nice to have in the llama server, llamafile was the only way I have figured out to run this things I've been hearing about for a year!

@ingenieroariel
Copy link

Speech input is a big feature in my use case, I do it now with GPT-4 on iPhone but doing the same with llamafile's server would be fantastic. What are the main blockers?

@smrl
Copy link

smrl commented Dec 13, 2023

Please, very interested in this use-case!

@flatsiedatsie
Copy link

Devil's advocate: it's not very difficult to run whisper separately and pipe any recognised sentences into Llamafile? I'm literally doing that right now, for example. It's also relatively easy to do in the browser.

What would be the benefit? Would this integration allow the LLM to start processing detected words earlier?

@jart
Copy link
Collaborator

jart commented Jan 24, 2024

It would have the same benefit that llamafile does. You wouldn't have to compile the software yourself.

@jart jart mentioned this issue Feb 19, 2024
@AmgadHasan
Copy link

Hi.

Is there any update regarding this request?

@jart
Copy link
Collaborator

jart commented Feb 19, 2024

I was able to build whisper.cpp using cosmocc with very few modifications.

diff --git a/Makefile b/Makefile
index 93c89cd..b3a89d7 100644
--- a/Makefile
+++ b/Makefile
@@ -39,7 +39,7 @@ endif
 #

 CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
+CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -fexceptions
 LDFLAGS  =

 ifdef MACOSX_DEPLOYMENT_TARGET
@@ -134,38 +134,38 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))
        ifdef CPUINFO_CMD
                AVX_M := $(shell $(CPUINFO_CMD) | grep -iwE 'AVX|AVX1.0')
                ifneq (,$(AVX_M))
-                       CFLAGS   += -mavx
-                       CXXFLAGS += -mavx
+                       CFLAGS   += -Xx86_64-mavx
+                       CXXFLAGS += -Xx86_64-mavx
                endif

                AVX2_M := $(shell $(CPUINFO_CMD) | grep -iw 'AVX2')
                ifneq (,$(AVX2_M))
-                       CFLAGS   += -mavx2
-                       CXXFLAGS += -mavx2
+                       CFLAGS   += -Xx86_64-mavx2
+                       CXXFLAGS += -Xx86_64-mavx2
                endif

                FMA_M := $(shell $(CPUINFO_CMD) | grep -iw 'FMA')
                ifneq (,$(FMA_M))
-                       CFLAGS   += -mfma
-                       CXXFLAGS += -mfma
+                       CFLAGS   += -Xx86_64-mfma
+                       CXXFLAGS += -Xx86_64-mfma
                endif

                F16C_M := $(shell $(CPUINFO_CMD) | grep -iw 'F16C')
                ifneq (,$(F16C_M))
-                       CFLAGS   += -mf16c
-                       CXXFLAGS += -mf16c
+                       CFLAGS   += -Xx86_64-mf16c
+                       CXXFLAGS += -Xx86_64-mf16c
                endif

                SSE3_M := $(shell $(CPUINFO_CMD) | grep -iwE 'PNI|SSE3')
                ifneq (,$(SSE3_M))
-                       CFLAGS   += -msse3
-                       CXXFLAGS += -msse3
+                       CFLAGS   += -Xx86_64-msse3
+                       CXXFLAGS += -Xx86_64-msse3
                endif

                SSSE3_M := $(shell $(CPUINFO_CMD) | grep -iw 'SSSE3')
                ifneq (,$(SSSE3_M))
-                       CFLAGS   += -mssse3
-                       CXXFLAGS += -mssse3
+                       CFLAGS   += -Xx86_64-mssse3
+                       CXXFLAGS += -Xx86_64-mssse3
                endif
        endif
 endif
diff --git a/ggml.c b/ggml.c
index 4ee2c5e..521eafe 100644
--- a/ggml.c
+++ b/ggml.c
@@ -24,7 +24,7 @@
 #include <stdarg.h>
 #include <signal.h>
 #if defined(__gnu_linux__)
-#include <syscall.h>
+#include <sys/syscall.h>
 #endif

 #ifdef GGML_USE_METAL
@@ -2069,6 +2069,8 @@ void ggml_numa_init(enum ggml_numa_strategy numa_flag) {
     int getcpu_ret = 0;
 #if __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ > 28)
     getcpu_ret = getcpu(&current_cpu, &g_state.numa.current_node);
+#elif defined(__COSMOPOLITAN__)
+    current_cpu = sched_getcpu(), getcpu_ret = 0;
 #else
     // old glibc doesn't have a wrapper for this call. Fall back on direct syscall
     getcpu_ret = syscall(SYS_getcpu,&current_cpu,&g_state.numa.current_node);

I made a couple changes to cosmopolitan upstream that'll be incorporated in the next release for making it easier to build. More work would need to be done to do it as well as llamafile packages llama.cpp. But until then, you have this:

whisperfile.gz

@versae
Copy link
Author

versae commented Feb 20, 2024

Wow, thanks @jart! That's amazing!
Just confirming that it works like a charm :D

$ whisperfile -m ggml-model-q5_0.bin samples/jfk.wav 
whisper_init_from_file_with_params_no_state: loading model from 'ggml-model-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:      CPU total size =  1080.47 MB
whisper_model_load: model size    = 1080.47 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.060 --> 00:00:07.500]   And so, my dear Americans, do not ask what your country can do for you.
[00:00:07.500 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =  1281.10 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    41.95 ms
whisper_print_timings:   sample time =   102.85 ms /   159 runs (    0.65 ms per run)
whisper_print_timings:   encode time = 29479.98 ms /     1 runs (29479.98 ms per run)
whisper_print_timings:   decode time =    38.76 ms /     1 runs (   38.76 ms per run)
whisper_print_timings:   batchd time =  3710.61 ms /   156 runs (   23.79 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 34662.24 ms

Any instructions on how to package it together with a GGML model? Thanks again!

@versae
Copy link
Author

versae commented Feb 20, 2024

I just tried to compile it myself, to see it I was able to also get the stream binary to work. But after I apply the patch, the make command errors out with exponent has no digits:

whisper.cpp:2575:27: error: exponent has no digits
 2575 |         double theta = (2*M_PI*i)/SIN_COS_N_COUNT;
      |                           ^~~~
whisper.cpp:2672:42: error: exponent has no digits
 2672 |         output[i] = 0.5*(1.0 - cosf((2.0*M_PI*i)/(length + offset)));
      |                                          ^~~~

I run it like this:

$ make CC=bin/cosmocc CXX=bin/cosmoc++ stream

@jart
Copy link
Collaborator

jart commented Feb 20, 2024

@versae in your cosmocc toolchain just change include/libc/math.h to use the non-hex constants instead:

#define M_E        2.7182818284590452354  /* 𝑒 */
#define M_LOG2E    1.4426950408889634074  /* log₂𝑒 */
#define M_LOG10E   0.43429448190325182765 /* log₁₀𝑒 */
#define M_LN2      0.69314718055994530942 /* logₑ2 */
#define M_LN10     2.30258509299404568402 /* logₑ10 */
#define M_PI       3.14159265358979323846 /* pi */
#define M_PI_2     1.57079632679489661923 /* pi/2 */
#define M_PI_4     0.78539816339744830962 /* pi/4 */
#define M_1_PI     0.31830988618379067154 /* 1/pi */
#define M_2_PI     0.63661977236758134308 /* 2/pi */
#define M_2_SQRTPI 1.12837916709551257390 /* 2/sqrt(pi) */
#define M_SQRT2    1.41421356237309504880 /* sqrt(2) */
#define M_SQRT1_2  0.70710678118654752440 /* 1/sqrt(2) */

This will ship in the next cosmocc release.

@versae
Copy link
Author

versae commented Feb 20, 2024

After some tweaking, I was able to compile my own cosmocc and then use it to compile main, quantize and even server 🎉 . However, for stream there seems to be some issue with SDL2 library.

/usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory
    4 | #include <SDL2/_real_SDL_config.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:402: stream] Error 1

I'll keep investigating as this could easily be some rookie mistake on my side.

Would it make sense to create a whisperllama repo for this?

@AmgadHasan
Copy link

After some tweaking, I was able to compile my own cosmocc and then use it to compile main, quantize and even server 🎉 . However, for stream there seems to be some issue with SDL2 library.

/usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory
    4 | #include <SDL2/_real_SDL_config.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:402: stream] Error 1

I'll keep investigating as this could easily be some rookie mistake on my side.

Would it make sense to create a whisperllama repo for this?

I would suggest calling it Whisperfile :)

@develperbayman
Copy link

develperbayman commented Apr 1, 2024

so i have been using a more generic method to get voice in/out it works flawless my issue has been getting the model to load at a decent speed i have many variation of this code idk if this one is broken or not but i know the voice in/out works flawlessly edit: also its in python so not sure if it helps but even if it does not sometimes simplicity is best text to speech can go a long way

import speech_recognition as sr
from llama_cpp import Llama
import pyttsx3
from pydub import AudioSegment
import simpleaudio
from transformers import AutoModel
import os


# Load GGUF model efficiently using llama-cpp
model = AutoModel.from_pretrained("sonu2023/Mistral-7B-Vatax-v1-q8_0-GUFF")

recognizer = sr.Recognizer()
chatbot_busy = False

engine = pyttsx3.init()


def play_activation_sound():
    # Replace './computer.wav' with the path to your activation sound in WAV format
    activation_sound = AudioSegment.from_file('./computer.wav')
    simpleaudio.play_buffer(activation_sound.raw_data, num_channels=activation_sound.channels, bytes_per_sample=activation_sound.sample_width, sample_rate=activation_sound.frame_rate)


def chatbot_response(user_input):
    global chatbot_busy
    response = ""

    if user_input and not chatbot_busy:
        print("User:", user_input)

        # Generate response using llama-cpp
        prompt = f"[USER]: {user_input}\n[BOT]: "  # Use a more explicit prompt format
        response = llm.create_chat_completion(prompt=prompt)["messages"][-1]["content"]

        print("Chatbot:", response)
        chatbot_busy = False

        # Text-to-speech with pyttsx3
        text_to_speech(response)


def text_to_speech(text):
    # Save the synthesized speech to a temporary WAV file
    engine.save_to_file(text, 'output.wav')
    engine.runAndWait()

    # Play the temporary WAV file
    synthesized_sound = AudioSegment.from_file('output.wav')
    simpleaudio.play_buffer(synthesized_sound.raw_data, num_channels=synthesized_sound.channels, bytes_per_sample=synthesized_sound.sample_width, sample_rate=synthesized_sound.frame_rate)

    # Remove the temporary WAV file
    os.remove('output.wav')


def listen_for_input():
    global chatbot_busy

    with sr.Microphone() as source:
        recognizer.adjust_for_ambient_noise(source)

        while True:
            try:
                print("Listening...")
                audio_data = recognizer.listen(source)
                user_input = recognizer.recognize_google(audio_data).lower()
                print("User:", user_input)

                if 'computer' in user_input:
                    print("Chatbot activated. Speak now.")
                    play_activation_sound()

                    audio_data = recognizer.listen(source)
                    print("Listening...")
                    user_input = recognizer.recognize_google(audio_data).lower()

                    # Generate and respond using llama-cpp
                    chatbot_response(user_input)

            except sr.UnknownValueError:
                print("Could not understand audio. Please try again.")
            except Exception as e:
                print(f"An error occurred: {e}")


# Start listening for input
input_thread = threading.Thread(target=listen_for_input)
input_thread.start()

@cjpais
Copy link
Collaborator

cjpais commented May 16, 2024

@jart I am in progress on getting a version of whisper.cpp built with llamafile, specifically the server example

The executable itself is working and seems to be compiling properly for CUDA. However I would love some help with the file loading from within the zipaligned archive. If you could provide some guidance on what needs to be done in order to implement this portion that would be great.

I have replaced the std::ifstream opening with llamafile_open_gguf, however I am running into errors with this. I recognize maybe this function needs modification in order to load the whisper models which are not .gguf directly. Currently I get the warning warning: not a pkzip archive and it seems like it is trying to load the file from the local directory as opposed to from the zipaligned version. Not sure if I need to manipulate the filepath in some way or if this is handled with some utility function.

I am currently using the files llama.cpp and server.cpp as reference for what I should be doing, but would love any help if you know the implementation off the top of your head.

@jart
Copy link
Collaborator

jart commented May 16, 2024

If you've already discovered llamafile/llamafile.c then I'm not sure what other high level guidance I can offer you.

@cjpais
Copy link
Collaborator

cjpais commented May 18, 2024

Thanks @jart, that was all I needed. My C skills are a bit rusty so it was great to know I wasn't missing anything obvious, instead I was just forgetting some C basics

For the time being I've forked the llamafile into: https://github.com/cjpais/whisperfile

If it makes sense to integrate directly into llamafile, I am happy to clean up the code and submit a PR. If this is the case just let me know how you would like the dirs to be structured

@jgbrwn
Copy link

jgbrwn commented May 27, 2024

Why not just try to integrate/cosmopolitan-ize talk-llama into llamafile? Didn’t he already do all the heavy-lifting around this perhaps?

@cjpais
Copy link
Collaborator

cjpais commented May 27, 2024

it can also be done, probably fairly easy to do in the whisperfile repo. I needed server for a project I am doing so that was my primary focus. If there is enough interest I can port over talk-llama, happy to accept PR's as well

@Tamnac
Copy link

Tamnac commented Sep 6, 2024

I believe this issue can be closed now that we have whisperfile

@jart
Copy link
Collaborator

jart commented Sep 6, 2024

The whisperfile project @cjpais posted earlier is now been made an official Mozilla project. See the whisper.cpp/ folder of the llamafile codebase. Releases have been published to https://huggingface.co/Mozilla/whisperfile so enjoy everyone!

@jart jart closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests