Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

Open
AKSHILMY opened this issue Jan 7, 2025 · 2 comments
Open
Assignees
Labels
needs more info This issue needs more information from the customer to proceed. type: question Request for information or clarification. Not an issue.

Comments

@AKSHILMY
Copy link

AKSHILMY commented Jan 7, 2025

The use of Journey voices in normal TTS gives an engaging voice that attracts a user.
But when I use the same Journey voice in a streaming text input -> streaming audio output kind of way, the audio I get is a less engaging voice that just speaks out the stuff.

Why is it such ?
I don't find any way to control that.

Reference TTS Example

@parthea parthea added the type: question Request for information or clarification. Not an issue. label Jan 20, 2025
@parthea
Copy link
Contributor

parthea commented Jan 21, 2025

Hi @AKSHILMY,

I'm going to transfer this issue to the python-docs-samples repository which is the source of truth for the code sample Reference TTS Example.

While running the code sample, I noticed that the response in audio_content when using streaming_synthesize contains headerless data.

audio_content (bytes):
The audio data bytes encoded as specified in
the request. This is headerless LINEAR16 audio
with a sample rate of 24000.

Please can you confirm that the necessary header was created to play the audio file? Please can you share the specific code used to create the audio header?

I created a code sample which contains the raw WAV file header (following the spec at https://docs.fileformat.com/audio/wav/) to help with debugging.

import google.cloud.texttospeech_v1 as texttospeech_v1
import itertools

client = texttospeech_v1.TextToSpeechClient()

# See https://cloud.google.com/text-to-speech/docs/voices for all voices.
streaming_config = texttospeech_v1.StreamingSynthesizeConfig(voice=texttospeech_v1.VoiceSelectionParams(name="en-US-Journey-F", language_code="en-US"))

# Set the config for your stream. The first request must contain your config, and then each subsequent request must contain text.
config_request = texttospeech_v1.StreamingSynthesizeRequest(streaming_config=streaming_config)

# Request generator. Consider using Gemini or another LLM with output streaming as a generator.
def request_generator():
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="Movies, oh my gosh, I just just absolutely love them."))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="They're like time machines taking you to different worlds and landscapes,"))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and um, "))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and I just can't get enough of it."))

streaming_responses = client.streaming_synthesize(itertools.chain([config_request], request_generator()))

# This is a raw header based on the spec at https://docs.fileformat.com/audio/wav/
header = b'RIFF\x00\x00\x00\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x00\x00\x02\x00\x10\x00data\x00\x00\x00\x00'

total_length = 0

with open(f"output.wav", "wb") as out:
    out.write(header)
    for response in streaming_responses:
        # calculate the length of the content
        total_length += len(response.audio_content)
        out.write(response.audio_content)
    # Position 40 - 43: Size of the data section
    out.seek(40)
    out.write(bytes([total_length & 0xFF, (total_length >> 8) & 0xFF, (total_length >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

import os
file_size = os.path.getsize("output.wav")

with open(f"output.wav", "r+b") as out:
    # Position 4-7: Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.
    out.seek(4)
    out.write(bytes([file_size & 0xFF, (file_size >> 8) & 0xFF, (file_size >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

@parthea
Copy link
Contributor

parthea commented Jan 21, 2025

Googlers see b/391302662

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info This issue needs more information from the customer to proceed. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants