Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core ML support #566

Merged
merged 5 commits into from
Apr 15, 2023
Merged

Core ML support #566

merged 5 commits into from
Apr 15, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 5, 2023

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

This PR extends whisper.cpp to run the Whisper Encoder on the ANE through Core ML inference.
The performance gain is more than x3 compared to 8-thread CPU for tiny, base and small models.

Here are initial performance benchmarks for the Encoder on M1 Pro with (top) and without (bottom) Core ML:

CPU OS Config Model Th Load [ms] Encode [ms] Commit
MacBook M1 Pro MacOS 13.2.1 CORE ML tiny 4 50 30 b0ac915
MacBook M1 Pro MacOS 13.2.1 CORE ML base 4 74 64 b0ac915
MacBook M1 Pro MacOS 13.2.1 CORE ML small 4 188 208 b0ac915
MacBook M1 Pro MacOS 13.2.1 CORE ML medium 4 533 1033 b0ac915
MacBook M1 Pro MacOS 13.2.1 CORE ML large 4 ? ? b0ac915
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---

This PR adds a helper script models/generate-coreml-model.sh that can be used to easily generate a Core ML Encoder model yourself. For now, I don't plan on hosting the Core ML models as there is some chance that the implementation will change in the future. Therefore, it is recommended that everyone simply generate them locally with that script. See the instructions below.

There are a couple of drawbacks:

  • First time running a Core ML model on a device takes a long time (several seconds, depending on the model).
    All follow-up runs are fast
  • The medium and large models take a long time to be converted to Core ML (tens of minutes) and require a lot of RAM. First run on a device is also very slow for them, so not sure if these are viable for production use

Acknowledgements

Huge thanks to @wangchou for the initial demonstration of how to use Core ML in whisper.cpp (#548)

Thanks to @RobertRiachi for optimizing for ANE execution and improving the model export process

Thanks to everyone else who participated in #548 and helped with insights, testing and ideas

Usage

  • Install dependencies:

    pip install ane_transformers
    pip install openai-whisper
    pip install coremltools
  • Generate a Core ML model. For example, to generate a base.en model, use:

    ./models/generate-coreml-model.sh base.en

    This will generate the folder models/ggml-base.en-encoder.mlmodelc

  • Build whisper.cpp with Core ML support:

    # using Makefile
    make clean
    WHISPER_COREML=1 make -j
    
    # using CMake
    cd build
    cmake -DWHISPER_COREML=1 ..
  • Run the examples as usual. For example:

    ./main -m models/ggml-base.en.bin -f samples/gb0.wav
    
    ...
    
    whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
    whisper_init_state: first run on a device may take a while ...
    whisper_init_state: Core ML model loaded
    
    system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 | 
    
    ...

    The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
    Next runs are faster.

TODO

  • Can the Decoder be ported to ANE too? Run encoder on Apple Neural Engine #548 (reply in thread)
    Answer: Yes, but it is slow
  • Convert the medium and large models to Core ML format and upload to HF
    Need a Mac Silicon with 64GB RAM to do the conversion from PyTorch -> Core ML
    Does not seem viable - too slow
  • Unified ggml + coreml model file
    We currently load both the full ggml model (encoder + decoder) and the coreml encoder - not optimal
    Will be done in the future, hopefully via community contributions
  • Scripts for generating Core ML model files (e.g. https://github.com/wangchou/callCoreMLFromCpp)
  • Support loading Core ML model from memory buffer
    Currently we support only loading from a folder on the disk
    Low-prio, hoping for contributions
  • Progress report for initial-run model processing
    Does not look possible. Any CoreML experts?
  • Adjust memory usage buffers when using Core ML
    Not needed - the Encoder compute buffer is less than 20MB even for the large model
  • Try to avoid the first on-device automatic model generation (it takes a long time)
    Does not look possible. Any CoreML experts?
  • The medium model takes more than 30 minutes to convert on the first run. Is there a work-around?
    I think no
  • Can we run the Core ML inference on the GPU?
    Looks like not worth it

Future work

  • Fix the ANE-optimized Whisper imeplementation. Currently, there is something wrong with the tensor shapes when passed to from / to whisper.cpp and the transcription gets corrupted. The optimized version should be about 1.5x faster than the original one
  • Add support for decoder-only ggml models. This will avoid having to store the Encoder data 2 times on disk / memory. Currently, it is store one time in the ggml model and another time in the Core ML model. This will reduce both disk and memory usage
  • Add support for running the Decoder on the ANE. Due to the nature of the Decoder operations, it seems that running on the CPU is generally more efficient in terms of speed compared to running it on the ANE. However, an ANE Decoder should be much more energy-efficient compared to the CPU one, so having this option could be useful in some cases

@brozkrut
Copy link

brozkrut commented Mar 6, 2023

Great work!

I tested coreml branch on Mac Mini M2 (base $599 model).

The performance gain seems to be more than x5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread).

Performance benchmarks for the Encoder with (top) and without (bottom) Core ML:

CPU OS Config Model Th Load Enc. Commit
Mac Mini M2 macOS 13.2.1 CORE ML tiny 4 44 25 17a1459
Mac Mini M2 macOS 13.2.1 CORE ML base 4 66 54 17a1459
Mac Mini M2 macOS 13.2.1 CORE ML small 4 163 190 17a1459
Mac Mini M2 macOS 13.2.1 CORE ML medium 4 17a1459
Mac Mini M2 macOS 13.2.1 CORE ML large 4 17a1459

CPU OS Config Model Th Load Enc. Commit
Mac Mini M2 macOS 13.2.1 NEON BLAS tiny 4 40 142 59fdcd1
Mac Mini M2 macOS 13.2.1 NEON BLAS base 4 67 299 59fdcd1
Mac Mini M2 macOS 13.2.1 NEON BLAS small 4 152 980 59fdcd1
Mac Mini M2 macOS 13.2.1 NEON BLAS medium 4 59fdcd1
Mac Mini M2 macOS 13.2.1 NEON BLAS large 4 59fdcd1

@DontEatOreo
Copy link

I compiled whisper.cpp with coreml support using make as well I build the mlmodel but I'm getting an error

whisper_init_from_file: loading model from 'models/ggml-base.en.mlmodelc'
whisper_model_load: loading model
whisper_model_load: invalid model data (bad magic)
whisper_init: failed to load model
error: failed to initialize whisper context

Is there anything else I'm missing? 🤔

@ggerganov
Copy link
Owner Author

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin.
The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

@DontEatOreo
Copy link

@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right

@dennislysenko
Copy link

This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it.

@strangelearning
Copy link

Just saw this was announced, is it useful? https://github.com/apple/ml-ane-transformers

@cerupcat
Copy link

cerupcat commented Apr 5, 2023

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

@lucabeetz
Copy link

Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up?

@ggerganov
Copy link
Owner Author

@DontEatOreo
On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork.

@ggerganov ggerganov marked this pull request as ready for review April 14, 2023 19:27
@ggerganov
Copy link
Owner Author

This is getting almost ready to merge. I am hoping to do it tomorrow.

The most important part that currently needs testing is the creation of the CoreML models, following the instructions here:

#548 (reply in thread)

If you give this a try, please let us know the results and if you encountered any issues.
Also, lets us know if you used quantized or not-quantized CoreML models and what has been the experience.

I believe that tiny, base and small models should be supported, while medium and large seem to not be viable for this approach.

@aehlke
Copy link

aehlke commented Apr 14, 2023

1.4gb for medium sounds fine for users, but you're saying there are other limitations against it?

@ggerganov
Copy link
Owner Author

@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate the medium model. After that, the first run is also very slow. Next runs are about 2 times faster compared to CPU-only.

In any case, you can follow the instructions in this PR and see how it works on your device.

@ggerganov ggerganov merged commit 5e47e22 into master Apr 15, 2023
@ggerganov ggerganov deleted the coreml branch April 15, 2023 10:21
@neurostar
Copy link

neurostar commented Apr 15, 2023

CPU OS Config Model Th Load Enc. Commit
MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML tiny 4 41 31 f19e23f
MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML base 4 59 57 f19e23f
MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML small 4 147 195 f19e23f
MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML medium 4 576 783 f19e23f
MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML large 4 1196 2551 f19e23f

Great work!
It was consuming ~9.7GB (short peak 15.03GB) memory converting large model to ML model, it worked fine on 8GB Air.

Edit:
I measured time of COREML model conversion and first loading conversion time (second-first).

Model COREML conv First Loading conv (sec)
tiny 4.915 0.72
base 8.564 1.34
small 26.050 4.72
medium 1:35.85 15.57
large 3:43.32 35.10

@CarberryChai
Copy link

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH

@flexchar
Copy link

Is it me or the link of CoreML models is missing on Hugging Face?

Btw, @ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :)

@ggerganov
Copy link
Owner Author

For now, you should generate the Core ML models locally following the instructions.
I don't want to host them on HF yet, because it is very likely that the models will change soon - there are a some pending improvements (see #548 (reply in thread)). If I upload them now, later we will get new models and everyone will be confused which model they are using, etc.

@flexchar
Copy link

In that regard I'd like to ask for help since I cant seem to succeed with it..

python3.10 ./models/convert-whisper-to-coreml.py --model tiny

100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]
Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]
Traceback (most recent call last):
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>
    decoder = convert_decoder(hparams, decoder, quantize=args.quantize)
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder
    traced_model = torch.jit.trace(model, (token_data, audio_data))
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace
    return trace_module(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module
    module._c._create_method_from_trace(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward
    x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward
    x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward
    k = self.key(x if xa is None else xa)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward
    return F.linear(
RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384)

@ganqqwerty
Copy link

These are stuck forever on M1 64G. I waited for 12 hours but still got no more messages. MacOS 13.5 (22G74).

whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

@artemgordinskiy
Copy link

artemgordinskiy commented Sep 2, 2023

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as @ganqqwerty:

  1. Built with WHISPER_COREML=1 make -j.
  2. Downloaded the large CoreML model from Huggingface
  3. Ran a sample overnight (~11 hours):
~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]
whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5
whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 2951.27 MB
whisper_model_load: model size    = 2950.66 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   985.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.84 ms
whisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)
whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)
whisper_print_timings:    total time = 40924196.00 ms
  1. And the consecutive runs go much faster now, with the model loading in just a few seconds:
whisper_print_timings:     load time =  1141.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.75 ms
whisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)
whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)
whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)
whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

@cust0mphase
Copy link

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as @ganqqwerty:

  1. Built with WHISPER_COREML=1 make -j.
  2. Downloaded the large CoreML model from Huggingface
  3. Ran a sample overnight (~11 hours):
~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]
whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5
whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 2951.27 MB
whisper_model_load: model size    = 2950.66 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   985.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.84 ms
whisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)
whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)
whisper_print_timings:    total time = 40924196.00 ms
  1. And the consecutive runs go much faster now, with the model loading in just a few seconds:
whisper_print_timings:     load time =  1141.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.75 ms
whisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)
whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)
whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)
whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

Can you upload it, please?

@artemgordinskiy
Copy link

@cust0mphase Upload what? The CoreML model link is in my comment above, and as far as I can see, the files have not changed since I downloaded them.

@ganqqwerty
Copy link

ganqqwerty commented Sep 3, 2023

I confirm that it works well with model from hugging face (of course, i use large). The performance boost in ventura 13.5 (22G74) is not that big, maybe 20%, but it's definitely faster. can't wait when the new OS come out.

@dhwkdjwndjwjjn
Copy link

ANE

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

@dhwkdjwndjwjjn
Copy link

ANE Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

Ok I just found out how to do it from other's discussion....

Screenshot 2023-09-18 at 12 41 36 PM

You can set it in file coreml/whisper-encoder.mm

And as for running Core ML with ./stream, you just need to run:

make clean
WHISPER_COREML=1 make stream -j

and then you can just ran ./stream normally and Core ML model will be loaded.

@dreampuf
Copy link

FYI: comparing with CPU+GPU vs. CPU+ANE:

# CPU + GPU
whisper_print_timings:     load time =   185.77 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   729.95 ms
whisper_print_timings:   sample time =  3544.57 ms /  8631 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  8853.00 ms /    49 runs (  180.67 ms per run)
whisper_print_timings:   decode time = 50679.41 ms /  8576 runs (    5.91 ms per run)
whisper_print_timings:   prompt time =  1938.64 ms /    52 runs (   37.28 ms per run)
whisper_print_timings:    total time = 66302.43 ms

## second-time
whisper_print_timings:     load time =   306.99 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   666.95 ms
whisper_print_timings:   sample time =  3934.44 ms /  8631 runs (    0.46 ms per run)
whisper_print_timings:   encode time =  7717.25 ms /    49 runs (  157.49 ms per run)
whisper_print_timings:   decode time = 51892.14 ms /  8576 runs (    6.05 ms per run)
whisper_print_timings:   prompt time =  1951.12 ms /    52 runs (   37.52 ms per run)
whisper_print_timings:    total time = 67378.17 ms
# CPU + ANE
whisper_print_timings:     load time =   426.37 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   655.52 ms
whisper_print_timings:   sample time =  4105.80 ms /  9129 runs (    0.45 ms per run)
whisper_print_timings:   encode time = 10249.34 ms /    48 runs (  213.53 ms per run)
whisper_print_timings:   decode time = 55378.71 ms /  9073 runs (    6.10 ms per run)
whisper_print_timings:   prompt time =  1981.35 ms /    52 runs (   38.10 ms per run)
whisper_print_timings:    total time = 73484.55 ms
# CPU + ALL
whisper_print_timings:     load time =   328.41 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   699.48 ms
whisper_print_timings:   sample time =  4050.11 ms /  9129 runs (    0.44 ms per run)
whisper_print_timings:   encode time = 10222.64 ms /    48 runs (  212.97 ms per run)
whisper_print_timings:   decode time = 54836.89 ms /  9073 runs (    6.04 ms per run)
whisper_print_timings:   prompt time =  1984.60 ms /    52 runs (   38.17 ms per run)
whisper_print_timings:    total time = 72802.16 ms

@astrowonk
Copy link

astrowonk commented Oct 1, 2023

I don't have precise before/after numbers, but CoreML Whisper sure seems a lot faster on Sonoma. Not just the "first run on a device may take a while …" step which is almost instant now, but the actual encoding seems better?

Maybe this is something improved in the latest versions of Whisper.cpp itself but it runs at close to 100% GPU usage now which I don't remember if that was always the case. ~5x faster than realtime with the medium.en model on my lowly regular M1.

@dreampuf
Copy link

dreampuf commented Oct 9, 2023

Here is an update after Sonoma.

# CPU + GPU
whisper_print_timings:     load time =   298.31 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   687.01 ms
whisper_print_timings:   sample time =  3626.06 ms /  8863 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  9034.63 ms /    48 runs (  188.22 ms per run)
whisper_print_timings:   decode time = 52123.91 ms /  8810 runs (    5.92 ms per run)
whisper_print_timings:   prompt time =  1883.27 ms /    51 runs (   36.93 ms per run)
whisper_print_timings:    total time = 69305.77 ms
ggml_metal_free: deallocating

# 2rd round
whisper_print_timings:     load time =   220.71 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   659.20 ms
whisper_print_timings:   sample time =  3607.61 ms /  8863 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  7268.91 ms /    48 runs (  151.44 ms per run)
whisper_print_timings:   decode time = 52101.25 ms /  8810 runs (    5.91 ms per run)
whisper_print_timings:   prompt time =  1880.41 ms /    51 runs (   36.87 ms per run)
whisper_print_timings:    total time = 66078.09 ms
# CPU + ANE
whisper_print_timings:     load time =   290.60 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   674.62 ms
whisper_print_timings:   sample time =  3722.67 ms /  9019 runs (    0.41 ms per run)
whisper_print_timings:   encode time = 10463.12 ms /    48 runs (  217.98 ms per run)
whisper_print_timings:   decode time = 52677.20 ms /  8963 runs (    5.88 ms per run)
whisper_print_timings:   prompt time =  1935.95 ms /    52 runs (   37.23 ms per run)
whisper_print_timings:    total time = 105001.48 ms

# 2rd round
whisper_print_timings:     load time =   218.93 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   647.12 ms
whisper_print_timings:   sample time =  3874.24 ms /  9019 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 10568.01 ms /    48 runs (  220.17 ms per run)
whisper_print_timings:   decode time = 53258.39 ms /  8963 runs (    5.94 ms per run)
whisper_print_timings:   prompt time =  1956.66 ms /    52 runs (   37.63 ms per run)
whisper_print_timings:    total time = 70788.73 ms
# CPU + ANE + GPU
whisper_print_timings:     load time =   203.14 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   679.72 ms
whisper_print_timings:   sample time =  3868.27 ms /  9019 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 10651.40 ms /    48 runs (  221.90 ms per run)
whisper_print_timings:   decode time = 53248.52 ms /  8963 runs (    5.94 ms per run)
whisper_print_timings:   prompt time =  1942.67 ms /    52 runs (   37.36 ms per run)
whisper_print_timings:    total time = 105808.82 ms

# 2rd round
whisper_print_timings:     load time =   223.98 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   650.97 ms
whisper_print_timings:   sample time =  3727.37 ms /  9019 runs (    0.41 ms per run)
whisper_print_timings:   encode time = 10526.05 ms /    48 runs (  219.29 ms per run)
whisper_print_timings:   decode time = 53171.40 ms /  8963 runs (    5.93 ms per run)
whisper_print_timings:   prompt time =  1950.87 ms /    52 runs (   37.52 ms per run)
whisper_print_timings:    total time = 70573.20 ms

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* coreml : use Core ML encoder inference

* coreml : simlpify whisper_encode + log messages

* whisper : resolve rebase conflicts

* coreml : add scripts for CoreML model generation

* bench-all : recognize COREML flag
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* coreml : use Core ML encoder inference

* coreml : simlpify whisper_encode + log messages

* whisper : resolve rebase conflicts

* coreml : add scripts for CoreML model generation

* bench-all : recognize COREML flag
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* coreml : use Core ML encoder inference

* coreml : simlpify whisper_encode + log messages

* whisper : resolve rebase conflicts

* coreml : add scripts for CoreML model generation

* bench-all : recognize COREML flag
@helins
Copy link

helins commented Feb 15, 2024

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instance stream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

@astrowonk
Copy link

astrowonk commented Feb 15, 2024

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instance stream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

Did you try the same model twice? There is still a considerable delay for me the first time the CoreML models run, but it is a little faster than the standard build for me after that. However I see very little ANE usage when I compile for CoreML, it's almost all GPU for me.

@helins
Copy link

helins commented Feb 15, 2024

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

@astrowonk
Copy link

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

I'm not 100% sure but after this PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

@RazeBerry
Copy link

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

I'm not 100% sure but after this PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

Just converted myself and took about 10 mins on M2 Pro + 16GB ram

@shell1986
Copy link

My model does not start, it just says that it does not find the file. Although the model is compiled and is in the folder.

@sahmed53
Copy link

sahmed53 commented May 18, 2024

I have posted this in the main issues section too (i apologise for the double post), but think maybe here people might be able to reply since it is a specific CoreML thread. My problem is about using CoreML in iOS apps i have noticed that the size of the app jump dramatically everytime coreML is fired up. Downloading the app container in xcode doesnt seem to show why the "documents and data" increases to many mb and sometimes gb with repeated usage. So i was wondering if anyone here has used the Objective-C sample or similar, can they check the app size after running - setting -> general -> storage? where could the app be saving coreml files? what could be going on?

This only happens with CoreML not Metal
I have cleared caches and temp files, but it doesnt effect the documents and data
xcode container doesnt not equal the same size as the settings indicator
I have looked in instruments but can't find the directory where files are being written to, it shows a tmp folder being written to with ANE weights?

This issue means it can't be deployed in production ready apps?

Screenshot 2024-05-18 at 15 14 25

Please someone help!

@day-dreaming-guy
Copy link

day-dreaming-guy commented Jul 28, 2024

Hey @sahmed53 ! Have you solved it?

@bjnortier
Copy link
Contributor

When you load a CoreML model the first time it does an optimisation and it saves that optimised model somewhere. I could never figure out where – it is something internal and hidden. I suspect that's what you're seeing. Sometimes the OS will delete those files (I assume when storage is low) and then when you load the CoreML model again it will do the optimisation step again. This can take very long on some devices.

This is why I've stopped using CoreML for my app and I only use the Metal version.

@aehlke
Copy link

aehlke commented Jul 29, 2024

@bjnortier does WhisperKit suffer from the same issue? they became quite popular and rely on CoreML rather than Metal

@bjnortier
Copy link
Contributor

@aehlke Yes, if you use the WhisperKit macOS TestFlight app you will see "Specializing [...] for you device... This can take several minutes on first load"

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* coreml : use Core ML encoder inference

* coreml : simlpify whisper_encode + log messages

* whisper : resolve rebase conflicts

* coreml : add scripts for CoreML model generation

* bench-all : recognize COREML flag
@androslaforc
Copy link

Hi, which version of Python should I use to install these dependencies? I tried 3.11 and 3.10, but failed to install all dependcecies.

pip install ane_transformers
pip install openai-whisper
pip install coremltools

have you got any answer about this ?

@androslaforc
Copy link

Hey, for people who are still struggling with trying this out, here is a bit of a script that shows what my environment is like. I've had a bit of trouble getting this running, so here is my attempt to make a minimally reproducible set of commands. I haven't checked speed improvements yet. This is on a macbook air M1, running Ventura 13.3.1

# I also had problems with xcode, this helped me when it didn't work after reinstalling xcode, thanks @neurostar 
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

# setting up coreml conda environment, 3.9 works for me
conda create --name core_whisper_3_9 python=3.9 -y 
conda activate core_whisper_3_9 
# torchvision and tensorflow aren't necessary, but I've had a bit of a problem getting coreml itself to work, so this is to run their example program
pip install torchvision==0.15.1 tensorflow-macos==2.9 coremltools==6.3 ane-transformers==0.1.1 openai-whisper==20230314

# checking whether coreml itself works correctly, no whisper.cpp involvement, code copied from [coremltools.readme.io](https://coremltools.readme.io/docs/convert-a-torchvision-model-from-pytorch)
python -c 'import coremltools as ct
import torch
import torchvision
# Load PyTorch model (and perform tracing)
torch_model = torchvision.models.mobilenet_v2()
torch_model.eval() 

example_input = torch.rand(1, 3, 256, 256)
traced_model = torch.jit.trace(torch_model, example_input)



# Convert using the same API. Note that we need to provide "inputs" for pytorch conversion.
model_from_torch = ct.convert(traced_model,
                              inputs=[ct.TensorType(name="input", 
                                                    shape=example_input.shape)],
                              debug=True)'


# ensure whisper.cpp repo is in the same state as mine was
cd <mypath>/whisper.cpp
git pull
git checkout v1.3.0
git clean -idx # interactive: remove downloaded models. you can skip this step, but it ensures you have a fresh install of the models

# building whisper.cpp with coreml support
mkdir build
cd build
cmake -DWHISPER_COREML=1 ..
make -j
cd ..

# download model and convert to coreml
bash ./models/download-ggml-model.sh tiny
bash ./models/generate-coreml-model.sh tiny

# try out jfk sample
./build/bin/main -m models/ggml-tiny.bin -f samples/jfk.wav

This led to the following output for the sample for me:

whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  129.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB
whisper_init_state: loading Core ML model from 'models/ggml-tiny-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =    97.74 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    41.69 ms
whisper_print_timings:   sample time =    10.67 ms /    25 runs (    0.43 ms per run)
whisper_print_timings:   encode time =    33.32 ms /     1 runs (   33.32 ms per run)
whisper_print_timings:   decode time =    38.86 ms /    25 runs (    1.55 ms per run)
whisper_print_timings:    total time =  4314.09 ms

Hope this helps someone.

Thanks a lot bro it works i was stuck i love you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.