sync: use encodec's latest version as a submodule (#124)

PABannier · Feb 13, 2024 · 3c4411d · 3c4411d
1 parent 4b6c18d
commit 3c4411d
Show file tree

Hide file tree

Showing 67 changed files with 3,194 additions and 3,946 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -4,10 +4,31 @@ on:
   push:
     branches:
       - main
-    paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']
+      - encodec-submodule-fix-ci
+    paths:
+      [
+        ".github/workflows/**",
+        "**/CMakeLists.txt",
+        "**/Makefile",
+        "**/*.h",
+        "**/*.hpp",
+        "**/*.c",
+        "**/*.cpp",
+        "**/*.cu",
+      ]
   pull_request:
     types: [opened, synchronize, reopened]
-    paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', ".github/workflows/**"]
+    paths:
+      [
+        "**/CMakeLists.txt",
+        "**/Makefile",
+        "**/*.h",
+        "**/*.hpp",
+        "**/*.c",
+        "**/*.cpp",
+        "**/*.cu",
+        ".github/workflows/**",
+      ]
 
 env:
   BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
@@ -24,7 +45,7 @@ jobs:
         id: checkout
         uses: actions/checkout@v4
         with:
-          submodules: true
+          submodules: recursive
 
       - name: Dependencies
         id: depends
@@ -35,6 +56,7 @@ jobs:
       - name: Build
         id: cmake_build
         run: |
+          cd bark
           mkdir build
           cd build
           cmake ..
@@ -48,7 +70,7 @@ jobs:
         id: checkout
         uses: actions/checkout@v4
         with:
-          submodules: true
+          submodules: recursive
 
       - name: Dependencies
         id: depends
@@ -60,6 +82,7 @@ jobs:
         id: cmake_build
         run: |
           sysctl -a
+          cd bark
           mkdir build
           cd build
           cmake ..

diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,3 @@
-[submodule "ggml"]
-	path = ggml
-	url = https://github.com/ggerganov/ggml.git
+[submodule "encodec.cpp"]
+	path = encodec.cpp
+	url = https://github.com/PABannier/encodec.cpp
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -71,6 +71,13 @@
         "algorithm": "cpp",
         "bit": "cpp",
         "cinttypes": "cpp",
-        "codecvt": "cpp"
-    }
+        "codecvt": "cpp",
+        "any": "cpp",
+        "forward_list": "cpp",
+        "ranges": "cpp",
+        "set": "cpp",
+        "span": "cpp",
+        "valarray": "cpp"
+    },
+    "cmake.sourceDirectory": "/Users/pbannier/Documents/bark.cpp/bark"
 }
diff --git a/README.md b/README.md
@@ -9,164 +9,92 @@
 
 Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.
 
-**Disclaimer: there remains bug in the inference code, bark is able to generate audio for some prompts or some seeds,
-but it does not work for most prompts. The current effort of the community is to fix those bugs, in order to release
-v0.0.2**.
-
 ## Description
 
-The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
+With `bark.cpp`, my goal is to bring **real-time realistic multilingual** text-to-speech generation to the community. Currently, I am focused on porting the [Bark](https://github.com/suno-ai/bark) model in C++.
 
-- [X] Plain C/C++ implementation without dependencies
-- [X] AVX, AVX2 and AVX512 for x86 architectures
-- [X] Mixed F16 / F32 precision
-- [X] 4-bit, 5-bit and 8-bit integer quantization
-- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
-- [ ] iOS on-device deployment using CoreML
+- [x] Plain C/C++ implementation without dependencies
+- [x] AVX, AVX2 and AVX512 for x86 architectures
+- [x] CPU and GPU compatible backends
+- [x] Mixed F16 / F32 precision
+- [x] 4-bit, 5-bit and 8-bit integer quantization
+- [x] Metal and CUDA backends
 
 The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple encoders in the future (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)), as well as music generation model (see [this](https://github.com/PABannier/bark.cpp/issues/62)). This project is for educational purposes.
 
 Demo on [Google Colab](https://colab.research.google.com/drive/1JVtJ6CDwxtKfFmEd8J4FGY2lzdL0d0jT?usp=sharing) ([#95](https://github.com/PABannier/bark.cpp/issues/95))
 
-**Supported platforms:**
+---
 
-- [X] Mac OS
-- [X] Linux
-- [X] Windows
+Here is a typical run using `bark.cpp`:
 
-**Supported models:**
+```java
+make -j && ./main -p "This is an audio generated by bark.cpp"
 
-- [X] Bark
-- [ ] Vocos
-- [ ] AudioCraft
+   __               __
+   / /_  ____ ______/ /__        _________  ____
+  / __ \/ __ `/ ___/ //_/       / ___/ __ \/ __ \
+ / /_/ / /_/ / /  / ,<    _    / /__/ /_/ / /_/ /
+/_.___/\__,_/_/  /_/|_|  (_)   \___/ .___/ .___/
+                                  /_/   /_/
 
----
 
-Here are typical audio pieces generated by `bark.cpp`:
+bark_tokenize_input: prompt: 'this is a dog barking.'
+bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 10217 27883 28169 25677 10167 129595
 
-https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff
+Generating semantic tokens: [========>                                          ] (17%)
 
-https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1
+bark_print_statistics: mem per token =     0.00 MB
+bark_print_statistics:   sample time =     9.90 ms / 138 tokens
+bark_print_statistics:  predict time =  3163.78 ms / 22.92 ms per token
+bark_print_statistics:    total time =  3188.37 ms
 
-Here is a typical run using Bark:
+Generating coarse tokens: [==================================================>] (100%)
 
-```java
-make -j && ./main -p "this is an audio"
-I bark.cpp build info:
-I UNAME_S:  Darwin
-I UNAME_P:  arm
-I UNAME_M:  arm64
-I CFLAGS:   -I. -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
-I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
-I LDFLAGS:   -framework Accelerate
-I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
-I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)
-
-bark_model_load: loading model from './ggml_weights'
-bark_model_load: reading bark text model
-gpt_model_load: n_in_vocab  = 129600
-gpt_model_load: n_out_vocab = 10048
-gpt_model_load: block_size  = 1024
-gpt_model_load: n_embd      = 1024
-gpt_model_load: n_head      = 16
-gpt_model_load: n_layer     = 24
-gpt_model_load: n_lm_heads  = 1
-gpt_model_load: n_wtes      = 1
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1894.87 MB
-gpt_model_load: memory size =   192.00 MB, n_mem = 24576
-gpt_model_load: model size  =  1701.69 MB
-bark_model_load: reading bark vocab
-
-bark_model_load: reading bark coarse model
-gpt_model_load: n_in_vocab  = 12096
-gpt_model_load: n_out_vocab = 12096
-gpt_model_load: block_size  = 1024
-gpt_model_load: n_embd      = 1024
-gpt_model_load: n_head      = 16
-gpt_model_load: n_layer     = 24
-gpt_model_load: n_lm_heads  = 1
-gpt_model_load: n_wtes      = 1
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1443.87 MB
-gpt_model_load: memory size =   192.00 MB, n_mem = 24576
-gpt_model_load: model size  =  1250.69 MB
-
-bark_model_load: reading bark fine model
-gpt_model_load: n_in_vocab  = 1056
-gpt_model_load: n_out_vocab = 1056
-gpt_model_load: block_size  = 1024
-gpt_model_load: n_embd      = 1024
-gpt_model_load: n_head      = 16
-gpt_model_load: n_layer     = 24
-gpt_model_load: n_lm_heads  = 7
-gpt_model_load: n_wtes      = 8
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1411.25 MB
-gpt_model_load: memory size =   192.00 MB, n_mem = 24576
-gpt_model_load: model size  =  1218.26 MB
-
-bark_model_load: reading bark codec model
-encodec_model_load: model size    =   44.32 MB
-
-bark_model_load: total model size  =    74.64 MB
-
-bark_generate_audio: prompt: 'this is an audio'
-bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
-bark_forward_text_encoder: ...........................................................................................................
-
-bark_forward_text_encoder: mem per token =     4.80 MB
-bark_forward_text_encoder:   sample time =     7.91 ms
-bark_forward_text_encoder:  predict time =  2779.49 ms / 7.62 ms per token
-bark_forward_text_encoder:    total time =  2829.35 ms
-
-bark_forward_coarse_encoder: .................................................................................................................................................................
-..................................................................................................................................................................
-
-bark_forward_coarse_encoder: mem per token =     8.51 MB
-bark_forward_coarse_encoder:   sample time =     3.08 ms
-bark_forward_coarse_encoder:  predict time = 10997.70 ms / 33.94 ms per token
-bark_forward_coarse_encoder:    total time = 11036.88 ms
-
-bark_forward_fine_encoder: .....
-
-bark_forward_fine_encoder: mem per token =     5.11 MB
-bark_forward_fine_encoder:   sample time =    39.85 ms
-bark_forward_fine_encoder:  predict time = 19773.94 ms
-bark_forward_fine_encoder:    total time = 19873.72 ms
-
-
-
-bark_forward_encodec: mem per token = 760209 bytes
-bark_forward_encodec:  predict time =   528.46 ms / 528.46 ms per token
-bark_forward_encodec:    total time =   663.63 ms
+bark_print_statistics: mem per token =     0.00 MB
+bark_print_statistics:   sample time =     3.96 ms / 410 tokens
+bark_print_statistics:  predict time = 14303.32 ms / 34.89 ms per token
+bark_print_statistics:    total time = 14315.52 ms
 
-Number of frames written = 51840.
+Generating fine tokens: [==================================================>] (100%)
 
+bark_print_statistics: mem per token =     0.00 MB
+bark_print_statistics:   sample time =    41.93 ms / 6144 tokens
+bark_print_statistics:  predict time = 15234.38 ms / 2.48 ms per token
+bark_print_statistics:    total time = 15282.15 ms
+
+Number of frames written = 51840.
 
 main:     load time =  1436.36 ms
 main:     eval time = 34520.53 ms
-main:    total time = 35956.92 ms
+main:    total time = 32786.04 ms
 ```
 
+Here are typical audio pieces generated by `bark.cpp`:
+
+https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff
+
+https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1
+
 ## Usage
 
-Here are the steps for the bark model.
+Here are the steps to use Bark.cpp
 
 ### Get the code
 
 ```bash
 git clone --recursive https://github.com/PABannier/bark.cpp.git
 cd bark.cpp
+git submodule update --init --recursive
 ```
 
 ### Build
 
 In order to build bark.cpp you must use `CMake`:
 
 ```bash
-mkdir build
-cd build
+mkdir bark/build
+cd bark/build
 cmake ..
 cmake --build . --config Release
 ```
@@ -175,43 +103,43 @@ cmake --build . --config Release
 
 ```bash
 # install Python dependencies
-python3 -m pip install -r requirements.txt
+python3 -m pip install -r bark/requirements.txt
 
 # obtain the original bark and encodec weights and place them in ./models
-python3 download_weights.py --download-dir ./models
+python3 bark/download_weights.py --download-dir ./models
 
 # convert the model to ggml format
-python3 convert.py \
+python3 bark/convert.py \
         --dir-model ./models \
-        --codec-path ./models \
         --vocab-path ./ggml_weights/ \
         --out-dir ./ggml_weights/
 
 # run the inference
-./main -m ./ggml_weights/ -p "this is an audio"
+./bark/build/examples/main/main -m ./ggml_weights/ -p "this is an audio"
 ```
 
 ### (Optional) Quantize weights
 
 Weights can be quantized using the following strategy: `q4_0`, `q4_1`, `q5_0`, `q5_1`, `q8_0`.
 
-Note that to preserve audio quality, we do not quantize the codec model. The bulk of the
-computation is in the forward pass of the GPT models.
+Note that to preserve audio quality, we do not quantize the codec model. The bulk of the computation is in the forward pass of the GPT models.
 
 ```bash
-./quantize ./ggml_weights/ggml_weights_text.bin ./ggml_weights_q4/ggml_weights_text.bin q4_0
-./quantize ./ggml_weights/ggml_weights_coarse.bin ./ggml_weights_q4/ggml_weights_coarse.bin q4_0
-./quantize ./ggml_weights/ggml_weights_fine.bin ./ggml_weights_q4/ggml_weights_fine.bin q4_0
+mkdir ggml_weights_q4
+cp ggml_weights/*vocab* ggml_weights_q4
+./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_text.bin ./ggml_weights_q4/ggml_weights_text.bin q4_0
+./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_coarse.bin ./ggml_weights_q4/ggml_weights_coarse.bin q4_0
+./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_fine.bin ./ggml_weights_q4/ggml_weights_fine.bin q4_0
 ```
 
-### Seminal papers and background on models
+### Seminal papers
 
 - Bark
-    - [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
+  - [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
 - Encodec
-    - [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
+  - [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
 - GPT-3
-    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+  - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
 
 ### Contributing
 
@@ -225,5 +153,3 @@ computation is in the forward pass of the GPT models.
 
 - Avoid adding third-party dependencies, extra files, extra headers, etc.
 - Always consider cross-compatibility with other operating systems and architectures
-- Avoid fancy looking modern STL constructs, keep it simple
-- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
diff --git a/bark-util.h b/bark-util.h