Skip to content

Commit

Permalink
feat: rebase laura's PR
Browse files Browse the repository at this point in the history
  • Loading branch information
louisjoecodes committed Dec 20, 2024
1 parent 19fdb30 commit 67afe1a
Show file tree
Hide file tree
Showing 18 changed files with 374 additions and 129 deletions.
14 changes: 14 additions & 0 deletions fern/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ navigation:
path: product/voices/pvc-step-by-step-guide.mdx
- page: Payouts
path: product/voices/voice-library/payouts.mdx
- section: Prompting
contents:
- page: Pronunciation
path: product/prompting/pronounciation.mdx
- page: Pauses
path: product/prompting/pauses.mdx
- page: Pacing and Emotion
path: product/prompting/pacing-and-emotion.mdx
- section: Workflows
contents:
- section: Projects
Expand Down Expand Up @@ -134,6 +142,10 @@ navigation:
contents:
- page: Overview
path: product/troubleshooting/overview.mdx
- page: Regenerations
path: product/troubleshooting/regenerations.mdx
- page: Error Messages
path: product/troubleshooting/error-messages.mdx
- section: Step by step
contents:
- page: Overview
Expand Down Expand Up @@ -473,6 +485,8 @@ redirects:
destination: /docs/conversational-ai/libraries/conversational-ai-sdk-python
- source: /docs/speech-synthesis/models
destination: /docs/developer-guides/models
- source: /docs/voices/voice-lab/instant-voice-cloning
destination: /docs/product/voices/voice-lab/instant-voice-cloning
# Overrides
- source: /docs/api-reference/overview
destination: /docs/api-reference/introduction
Expand Down
2 changes: 1 addition & 1 deletion fern/product/guides/getting-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ slug: product/guides/getting-started

This guide covers everything from account creation to advanced voice cloning, speech synthesis techniques, dubbing, and expert voiceover.

## [Guides](/docs/product/guides)
## [Guides](/docs/product/guides/getting-started)

<CardGroup cols={2}>
<Card
Expand Down
4 changes: 2 additions & 2 deletions fern/product/guides/speech-synthesis.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ Let’s touch on models and voice settings briefly before generating our audio c
More detailed information about the models is available [here](product/speech-synthesis/models).

- **Multilingual v2 (default)**: Supports 28 languages, known for its accuracy and stability, especially when using high-quality samples.
- **Turbo v2.5**: Generates speech in 32 languages with low latency, ideal for real-time applications.
- **Turbo v2**: Optimized for low-latency English text-to-speech, similar in performance to Turbo v2.5.
- **Flash v2.5**: Generates speech in 32 languages with low latency, ideal for real-time applications.
- **Flash v2**: Optimized for low-latency English text-to-speech, similar in performance to Flash v2.5.
- **English v1**: The oldest and fastest model, best for audiobooks but less accurate.
- **Multilingual v1**: Experimental, surpassed by Multilingual v2, recommended for short text chunks.

Expand Down
45 changes: 12 additions & 33 deletions fern/product/projects/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -104,36 +104,15 @@ Once your Project is converted, you have several download options available.

## Pronunciation Dictionaries

Sometimes you may want to specify the pronunciation of certain words, such as character/brand names, or to specify how acronyms should be read. Pronunciation dictionaries allow this functionality by enabling you to upload a lexicon or dictionary file that specifies pairs of words and how they should be pronounced, either using a phonetic alphabet or word substitutions. Whenever one of these words is encountered in a project, the AI model will pronounce the word using the specified replacement.

To provide a pronunciation dictionary file, open the settings for a project and upload a file in the [.PLS format](https://www.w3.org/TR/pronunciation-lexicon/). When a dictionary is added to a project it will automatically recalculate which pieces of the project will need to be re-converted using the new dictionary file and mark these as unconverted.

Currently we only support PLS files that specify replacements using Phonemes, or Aliases.

- Phonemes. Phonemes are used to specify pronunciation using either the IPA (International Phonetic Alphabet) or CMU Arpabet alphabet. Phoneme rules are currently only supported by the Turbo v2 English model.
- Aliases. Aliases are used to specify pronunciation using other words or phrases. For example, to specify that the "UN" should be read "United Nations" whenever it is encountered in a project. You can use aliases with all models.

Both sets of rules specify a word or phrase they are looking for, referred to as a grapheme in the PLS files, and then their replacement. Please note that searches are case sensitive.

Here is an example PLS file that specifies in IPA the pronunciation of "Apple" with IPA of "ˈæpl̩" and "UN" with an alias of "United Nations":

```
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-GB">
<lexeme>
<grapheme>Apple</grapheme>
<phoneme>ˈæpl̩</phoneme>
</lexeme>
<lexeme>
<grapheme>UN</grapheme>
<alias>United Nations</alias>
</lexeme>
</lexicon>
```

When checking for a replacement word in a pronunciation dictionary, the dictionary is checked from start to end and only the very first replacement is used.
Sometimes you may want to specify the pronunciation of certain words, such as character or brand names, or specify how acronyms should be read. Pronunciation dictionaries allow this functionality by enabling you to upload a lexicon or dictionary file that includes rules about how specified words should be pronounced, either using a phonetic alphabet (phoneme tags) or word substitutions (alias tags).

Whenever one of these words is encountered in a project, the AI will pronounce the word using the specified replacement. When checking for a replacement word in a pronunciation dictionary, the dictionary is checked from start to end and only the first replacement is used.

You can add a pronunciation dictionary to your project from the General tab in Project settings.

<Card
title="Pronunciation Dictionaries"
icon="book"
horizontal="true"
href="/docs/product/prompting/pronunciation#pronunciation-dictionaries"
/>
37 changes: 37 additions & 0 deletions fern/product/prompting/pacing-and-emotion.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Pacing and Emotion
subtitle: Effective techniques to guide ElevenLabs AI in pacing the speech and conveying emotions.
og:title: Prompting - Pacing and Emotion | ElevenLabs Docs
---

## Pacing

Based on varying user feedback and test results, it's been theorized that using a singular long sample for voice cloning has brought more success for some, compared to using multiple smaller samples. The current theory is that the AI stitches these samples together without any separation, causing pacing issues and faster speech. This is likely why some people have reported fast-talking clones.

To control the pacing of the speaker, you can write in a style similar to that of a book. While it's not a perfect solution, it can help improve the pacing and ensure that the AI generates a voiceover at the right speed. With this technique, you can create high-quality voiceovers that are both customized and easy to listen to.

```
"I wish you were right, I truly do, but you're not," he said slowly.
```



## Emotion

If you want the AI to express a specific emotion, the best approach is to write in a style similar to that of a book. To find good prompts to use, you can flip through some books and identify words and phrases that convey the desired emotion.

For instance, you can use dialogue tags to express emotions, such as `he said, confused`, or `he shouted angrily`. These types of prompts will help the AI understand the desired emotional tone and try to generate a voiceover that accurately reflects it. With this approach, you can create highly customized voiceovers that are perfect for a variety of applications.

```
"Are you sure about that?" he said, confused.
"Don’t test me!" he shouted angrily.
```

You will also have to somehow remove the prompt as the AI will read exactly what you give it. The AI can also sometimes infer the intended emotion from the text’s context, even without the use of tags.

```
"That is funny!"
"You think so?"
```

This is not always perfect since you are relying on the AI to understand if something is sarcastic, funny etc from the context of the text.
35 changes: 35 additions & 0 deletions fern/product/prompting/pauses.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Pauses
subtitle: How to add pauses to your generated speech.
og:title: Prompting - Pauses | ElevenLabs Docs
---

There are a few ways to introduce a pause or break and influence the rhythm and cadence of the speaker. The most consistent way is programmatically using the syntax `<break time="1.5s" />`. This will create an exact and natural pause in the speech. It is not just added silence between words, but the AI has an actual understanding of this syntax and will add a natural pause.

An example could look like this:

```
"Give me one second to think about it." <break time="1.0s" /> "Yes, that would work."
```

Break time should be described in seconds, and the AI can handle pauses of up to 3 seconds in length.

However, since this is more than just inserted silence, how the AI handles these pauses can vary. As usual, the voice used plays a pivotal role in the output. Some voices, for example, voices trained on data with "uh"s and "ah"s in them, have been shown to sometimes insert those vocal mannerisms during the pauses like a real speaker might. This is more prone to happen if you add a break tag at the very start or very end of your text.

<Info>Please avoid using an excessive number of break tags as that has shown to potentially cause some instability in the AI. The speech of the AI might start speeding up and become very fast, or it might introduce more noise in the audio and a few other strange artifacts. We are working on resolving this.</Info>

### Alternatives

<u>These options are inconsistent and might not always work</u>. We recommend using the syntax above for consistency.

One trick that seems to provide the most consistence output - sans the above option - is a simple dash `-` or the em-dash ``. You can even add multiple dashes such as `-- --` for a longer pause.

```
"It - is - getting late."
```

Ellipsis `...` can <u>sometimes</u> also work to add a pause between words but usually also adds some "hesitation" or "nervousness" to the voice that might not always fit.

```
I... yeah, I guess so..."
```
125 changes: 125 additions & 0 deletions fern/product/prompting/pronounciation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Pronunciation
subtitle: Effective techniques to guide ElevenLabs AI to achieve the correct pronunciation.
og:title: Prompting - Pronunciation | ElevenLabs Docs
---

## Phoneme Tags

<Info>This feature is currently only supported by the "Eleven Flash/Turbo v2" and "Eleven English v1" models</Info>

In certain instances, you may want the model to pronounce a word, name, or phrase in a specific way. Pronunciation can be specified using standardised pronunciation alphabets. Currently we support the International Phonetic Alphabet (IPA) and the CMU Arpabet. Pronunciations are specified by wrapping words using the Speech Synthesis Markup Language (SSML) phoneme tag.

To use this feature as part of your text prompt, you need to wrap the desired word or phrase in the phoneme tag. In each case, replace `"your-IPA-Pronunciation-here"` or `"your-CMU-pronunciation-here"` with your desired IPA or CMU Arpabet pronunciation:

`<phoneme alphabet="ipa" ph="your-IPA-Pronunciation-here">word</phoneme>`.

`<phoneme alphabet="cmu-arpabet" ph="your-CMU-pronunciation-here">word</phoneme>`



An example for IPA:

```
<phoneme alphabet="ipa" ph="ˈæktʃuəli">actually</phoneme>
```

An example for CMU Arpabet:

```
<phoneme alphabet="cmu-arpabet" ph="AE K CH UW AH L IY">actually</phoneme>
```

It is important to note that this only works per word. Meaning that if you, for example, have a name with a first and last name that you want to be pronounced a certain way, you will have to create the pronunciation for each word individually.

English is a lexical stress language, which means that within multi-syllable words, some syllables are emphasized more than others. The relative salience of each syllable is crucial for proper pronunciation and meaning distinctions. So, it is very important to remember to include the lexical stress when writing both IPA and ARPAbet as otherwise, the outcome might not be optimal.

Take the word "talon", for example.

Incorrect:

```
<phoneme alphabet="cmu-arpabet" ph="T AE L AH N">talon</phoneme>
```

Correct:

```
<phoneme alphabet="cmu-arpabet" ph="T AE1 L AH0 N">talon</phoneme>
```

The first example might switch between putting the primary emphasis on AE and AH, while the second example will always be pronounced reliably with the emphasis on AE and no stress on AH.

If you write it as:

```
<phoneme alphabet="cmu-arpabet" ph="T AE0 L AH1 N">talon</phoneme>
```

It will always put emphasis on AH instead of AE.

<Info>With the current implementation, we recommend using the CMU ARPAbet as it seems to be a bit more consistent and predictable with the current iteration of AI models. Some people get excellent results with IPA, but we have noticed that ARPAbet seems to work better with the current AI and be more consistent for a lot of users. However, we are working on improving this.</Info>

### Alternatives

Because phoneme tags are only supported by the Flash/Turbo v2 and English v1 models, if you're using the Multilingual v2 or Flash/Turbo v2.5 model, you might need to try alternative methods to get the desired pronunciation for a word. You can find an alternative spelling and write a word more phonetically. You can also employ various tricks such as capital letters, dashes, apostrophes, or even single quotation marks around a single letter or letters.

As an example, a word like "trapezii" could be spelt "trapezIi" to put more emphasis on the "ii" of the word.


## Pronunciation Dictionaries

Some of our tools, such as Projects and Dubbing Studio, allow you to create and upload a pronunciation dictionary. These allow you to specify the pronunciation of certain words, such as character or brand names, or to specify how acronyms should be read. Pronunciation dictionaries allow this functionality by enabling you to upload a lexicon or dictionary file that specifies pairs of words and how they should be pronounced, either using a phonetic alphabet (phoneme tags) or word substitutions (alias tags).

Whenever one of these words is encountered in a project, the AI model will pronounce the word using the specified replacement. When checking for a replacement word in a pronunciation dictionary, the dictionary is checked from start to end and only the first replacement is used.

To provide a pronunciation dictionary file, open the settings for a project and upload a file in either TXT or the [.PLS format](https://www.w3.org/TR/pronunciation-lexicon/). When a dictionary is added to a project it will automatically recalculate which pieces of the project will need to be re-converted using the new dictionary file and mark these as unconverted.

Currently we only support pronunciation dictionaries that specify replacements using phonemes or aliases.

Both phonemes and aliases are sets of rules that specify a word or phrase they are looking for, referred to as a grapheme, and what it will be replaced with. Please note that searches are case sensitive.

<Card
title="Phoneme Tags"
icon="book"
horizontal="true"
href="/docs/product/prompting/pronunciation#phoneme-tags"
/>

### Alias Tags

The alias tag is used to specify pronunciation using other words or phrases. For example, you could use an alias tag to specify that "UN" should be read as "United Nations" whenever it is encountered in a project.

If you're generating using Multilingual v2 or Flash/Turbo v2.5, which don't support phoneme tags, you can use alias tags to specify how you want a word to be pronounced using other words or by spelling the word out more phonetically. Alias tags can be used with all our models, so they can be useful for specifying pronunciation when included in a pronunciation dictionary for Projects, Dubbing Studio or Speech Synthesis via the API.

For example, if your text includes a name that has an unusual pronunciation that the AI might struggle with, you could use an alias tag to specify how you would like it to be pronounced:

```
<lexeme>
<grapheme>Claughton</grapheme>
<alias>Cloffton</alias>
</lexeme>
```

### Pronunciation Dictionary Example

Here is an example pronunciation dictionary that specifies in IPA the pronunciation of "Apple" with IPA of "ˈæpl̩" and "UN" with an alias of "United Nations":

```
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-GB">
<lexeme>
<grapheme>Apple</grapheme>
<phoneme>ˈæpl̩</phoneme>
</lexeme>
<lexeme>
<grapheme>UN</grapheme>
<alias>United Nations</alias>
</lexeme>
</lexicon>
```
4 changes: 2 additions & 2 deletions fern/product/speech-synthesis/models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,9 @@ Model latency is as low as 75ms (excl. network), making it ideal for real-time i
- Vietnamese
</Accordion>

**Eleven Turbo v2**
**Eleven Flash v2**

A low-latency, English-only model optimized for conversational applications. Turbo v2 is similar in performance to Turbo v2.5 but focused exclusively on English, making it ideal for English-only use cases where speed is critical.
A low-latency, English-only model optimized for conversational applications. Flash v2 is similar in performance to Flash v2.5 but focused exclusively on English, making it ideal for English-only use cases where speed is critical.
- Great quality
- High accuracy with Professional Voice Clones
- Slightly less stable
Expand Down
4 changes: 2 additions & 2 deletions fern/product/speech-synthesis/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ Getting yourself familiar with these different settings and options will be very
</Accordion>

<Accordion title="Models">
As of September 2024, ElevenLabs offers two families of models: standard (high-quality) models and Turbo models, which are optimized for low latency. Each family includes both English-only and multilingual models, tailored for specific use cases with strengths in either speed, accuracy, or language diversity.
As of December 2024, ElevenLabs offers two families of models: standard (high-quality) models and Flash models, which are optimized for low latency. Each family includes both English-only and multilingual models, tailored for specific use cases with strengths in either speed, accuracy, or language diversity.

- **Standard models** (Multilingual v2, Multilingual v1, English v1) are optimized for quality and accuracy, ideal for content creation. These models offer the best quality and stability but have higher latency.
- **Turbo models** (Turbo v2, Turbo v2.5) are designed for low-latency applications like real-time conversational AI. They deliver great performance with faster processing speeds, though with a slight trade-off in accuracy and stability.
- **Flash models** (Flash v2, Flash v2.5) are designed for low-latency applications like real-time conversational AI. They deliver great performance with faster processing speeds, though with a slight trade-off in accuracy and stability.

If you want to find more detailed specifications about which languages each model offers, you can find all that information in our help article [here](https://help.elevenlabs.io/hc/en-us/articles/17883183930129-What-models-do-you-offer-and-what-is-the-difference-between-them-).

Expand Down
Loading

0 comments on commit 67afe1a

Please sign in to comment.