Skip to content

Commit

Permalink
sync: use encodec's latest version as a submodule (#124)
Browse files Browse the repository at this point in the history
  • Loading branch information
PABannier authored Feb 13, 2024
1 parent 4b6c18d commit 3c4411d
Show file tree
Hide file tree
Showing 67 changed files with 3,194 additions and 3,946 deletions.
31 changes: 27 additions & 4 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,31 @@ on:
push:
branches:
- main
paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']
- encodec-submodule-fix-ci
paths:
[
".github/workflows/**",
"**/CMakeLists.txt",
"**/Makefile",
"**/*.h",
"**/*.hpp",
"**/*.c",
"**/*.cpp",
"**/*.cu",
]
pull_request:
types: [opened, synchronize, reopened]
paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', ".github/workflows/**"]
paths:
[
"**/CMakeLists.txt",
"**/Makefile",
"**/*.h",
"**/*.hpp",
"**/*.c",
"**/*.cpp",
"**/*.cu",
".github/workflows/**",
]

env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
Expand All @@ -24,7 +45,7 @@ jobs:
id: checkout
uses: actions/checkout@v4
with:
submodules: true
submodules: recursive

- name: Dependencies
id: depends
Expand All @@ -35,6 +56,7 @@ jobs:
- name: Build
id: cmake_build
run: |
cd bark
mkdir build
cd build
cmake ..
Expand All @@ -48,7 +70,7 @@ jobs:
id: checkout
uses: actions/checkout@v4
with:
submodules: true
submodules: recursive

- name: Dependencies
id: depends
Expand All @@ -60,6 +82,7 @@ jobs:
id: cmake_build
run: |
sysctl -a
cd bark
mkdir build
cd build
cmake ..
Expand Down
6 changes: 3 additions & 3 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[submodule "ggml"]
path = ggml
url = https://github.com/ggerganov/ggml.git
[submodule "encodec.cpp"]
path = encodec.cpp
url = https://github.com/PABannier/encodec.cpp
11 changes: 9 additions & 2 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,13 @@
"algorithm": "cpp",
"bit": "cpp",
"cinttypes": "cpp",
"codecvt": "cpp"
}
"codecvt": "cpp",
"any": "cpp",
"forward_list": "cpp",
"ranges": "cpp",
"set": "cpp",
"span": "cpp",
"valarray": "cpp"
},
"cmake.sourceDirectory": "/Users/pbannier/Documents/bark.cpp/bark"
}
196 changes: 61 additions & 135 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,164 +9,92 @@

Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.

**Disclaimer: there remains bug in the inference code, bark is able to generate audio for some prompts or some seeds,
but it does not work for most prompts. The current effort of the community is to fix those bugs, in order to release
v0.0.2**.

## Description

The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
With `bark.cpp`, my goal is to bring **real-time realistic multilingual** text-to-speech generation to the community. Currently, I am focused on porting the [Bark](https://github.com/suno-ai/bark) model in C++.

- [X] Plain C/C++ implementation without dependencies
- [X] AVX, AVX2 and AVX512 for x86 architectures
- [X] Mixed F16 / F32 precision
- [X] 4-bit, 5-bit and 8-bit integer quantization
- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
- [ ] iOS on-device deployment using CoreML
- [x] Plain C/C++ implementation without dependencies
- [x] AVX, AVX2 and AVX512 for x86 architectures
- [x] CPU and GPU compatible backends
- [x] Mixed F16 / F32 precision
- [x] 4-bit, 5-bit and 8-bit integer quantization
- [x] Metal and CUDA backends

The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple encoders in the future (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)), as well as music generation model (see [this](https://github.com/PABannier/bark.cpp/issues/62)). This project is for educational purposes.

Demo on [Google Colab](https://colab.research.google.com/drive/1JVtJ6CDwxtKfFmEd8J4FGY2lzdL0d0jT?usp=sharing) ([#95](https://github.com/PABannier/bark.cpp/issues/95))

**Supported platforms:**
---

- [X] Mac OS
- [X] Linux
- [X] Windows
Here is a typical run using `bark.cpp`:

**Supported models:**
```java
make -j && ./main -p "This is an audio generated by bark.cpp"

- [X] Bark
- [ ] Vocos
- [ ] AudioCraft
__ __
/ /_ ____ ______/ /__ _________ ____
/ __ \/ __ `/ ___/ //_/ / ___/ __ \/ __ \
/ /_/ / /_/ / / / ,< _ / /__/ /_/ / /_/ /
/_.___/\__,_/_/ /_/|_| (_) \___/ .___/ .___/
/_/ /_/

---

Here are typical audio pieces generated by `bark.cpp`:
bark_tokenize_input: prompt: 'this is a dog barking.'
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 10217 27883 28169 25677 10167 129595

https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff
Generating semantic tokens: [========> ] (17%)

https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1
bark_print_statistics: mem per token = 0.00 MB
bark_print_statistics: sample time = 9.90 ms / 138 tokens
bark_print_statistics: predict time = 3163.78 ms / 22.92 ms per token
bark_print_statistics: total time = 3188.37 ms

Here is a typical run using Bark:
Generating coarse tokens: [==================================================>] (100%)

```java
make -j && ./main -p "this is an audio"
I bark.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)

bark_model_load: loading model from './ggml_weights'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 7
gpt_model_load: n_wtes = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size = 44.32 MB

bark_model_load: total model size = 74.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token = 4.80 MB
bark_forward_text_encoder: sample time = 7.91 ms
bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
bark_forward_text_encoder: total time = 2829.35 ms

bark_forward_coarse_encoder: .................................................................................................................................................................
..................................................................................................................................................................

bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 3.08 ms
bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
bark_forward_coarse_encoder: total time = 11036.88 ms

bark_forward_fine_encoder: .....

bark_forward_fine_encoder: mem per token = 5.11 MB
bark_forward_fine_encoder: sample time = 39.85 ms
bark_forward_fine_encoder: predict time = 19773.94 ms
bark_forward_fine_encoder: total time = 19873.72 ms



bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
bark_forward_encodec: total time = 663.63 ms
bark_print_statistics: mem per token = 0.00 MB
bark_print_statistics: sample time = 3.96 ms / 410 tokens
bark_print_statistics: predict time = 14303.32 ms / 34.89 ms per token
bark_print_statistics: total time = 14315.52 ms

Number of frames written = 51840.
Generating fine tokens: [==================================================>] (100%)

bark_print_statistics: mem per token = 0.00 MB
bark_print_statistics: sample time = 41.93 ms / 6144 tokens
bark_print_statistics: predict time = 15234.38 ms / 2.48 ms per token
bark_print_statistics: total time = 15282.15 ms

Number of frames written = 51840.

main: load time = 1436.36 ms
main: eval time = 34520.53 ms
main: total time = 35956.92 ms
main: total time = 32786.04 ms
```

Here are typical audio pieces generated by `bark.cpp`:

https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff

https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1

## Usage

Here are the steps for the bark model.
Here are the steps to use Bark.cpp

### Get the code

```bash
git clone --recursive https://github.com/PABannier/bark.cpp.git
cd bark.cpp
git submodule update --init --recursive
```

### Build

In order to build bark.cpp you must use `CMake`:

```bash
mkdir build
cd build
mkdir bark/build
cd bark/build
cmake ..
cmake --build . --config Release
```
Expand All @@ -175,43 +103,43 @@ cmake --build . --config Release

```bash
# install Python dependencies
python3 -m pip install -r requirements.txt
python3 -m pip install -r bark/requirements.txt

# obtain the original bark and encodec weights and place them in ./models
python3 download_weights.py --download-dir ./models
python3 bark/download_weights.py --download-dir ./models

# convert the model to ggml format
python3 convert.py \
python3 bark/convert.py \
--dir-model ./models \
--codec-path ./models \
--vocab-path ./ggml_weights/ \
--out-dir ./ggml_weights/

# run the inference
./main -m ./ggml_weights/ -p "this is an audio"
./bark/build/examples/main/main -m ./ggml_weights/ -p "this is an audio"
```

### (Optional) Quantize weights

Weights can be quantized using the following strategy: `q4_0`, `q4_1`, `q5_0`, `q5_1`, `q8_0`.

Note that to preserve audio quality, we do not quantize the codec model. The bulk of the
computation is in the forward pass of the GPT models.
Note that to preserve audio quality, we do not quantize the codec model. The bulk of the computation is in the forward pass of the GPT models.

```bash
./quantize ./ggml_weights/ggml_weights_text.bin ./ggml_weights_q4/ggml_weights_text.bin q4_0
./quantize ./ggml_weights/ggml_weights_coarse.bin ./ggml_weights_q4/ggml_weights_coarse.bin q4_0
./quantize ./ggml_weights/ggml_weights_fine.bin ./ggml_weights_q4/ggml_weights_fine.bin q4_0
mkdir ggml_weights_q4
cp ggml_weights/*vocab* ggml_weights_q4
./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_text.bin ./ggml_weights_q4/ggml_weights_text.bin q4_0
./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_coarse.bin ./ggml_weights_q4/ggml_weights_coarse.bin q4_0
./bark/build/examples/quantize/quantize ./ggml_weights/ggml_weights_fine.bin ./ggml_weights_q4/ggml_weights_fine.bin q4_0
```

### Seminal papers and background on models
### Seminal papers

- Bark
- [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
- [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
- Encodec
- [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
- [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
- GPT-3
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)

### Contributing

Expand All @@ -225,5 +153,3 @@ computation is in the forward pass of the GPT models.

- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, keep it simple
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
30 changes: 0 additions & 30 deletions bark-util.h

This file was deleted.

Loading

0 comments on commit 3c4411d

Please sign in to comment.