Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

By default OpenAI returns an mp3 while Kokoros seems to return a wav #63

Open
rikhuijzer opened this issue Feb 22, 2025 · 3 comments
Open

Comments

@rikhuijzer
Copy link
Contributor

From the OpenAI docs, OpenAI returns an mp3:

curl https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Today is a wonderful day to build something people love!",
    "voice": "alloy"
  }' \
  --output speech.mp3

But if I look at the file that I get from the Kokoros OpenAI compatible endpoint, it looks like it's a wav file?

$ file tests/tmp-openai-compatible.mp3
tests/tmp-openai-compatible.mp3: RIFF (little-endian) data, WAVE audio, IEEE Float, mono 24000 Hz

This is the request I made

$ export HOST="http://localhost:3000"

$ curl "$HOST/v1/audio/speech" \
  --verbose \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Today is a wonderful day to build something people love!",
    "voice": "am_adam"
  }' \
  --output speech.mp3

VLC is unable to play it and ffmpeg also crashes when trying to convert it (while it worked with OpenAI's mp3). The vscode audio-preview extension can play it without problems.

@lucasjinreal
Copy link
Owner

Hello, this format should be able to modify when user request, wav is the default format used in save, would u like add a PR to support mp3 return specific in openai returning?

@7jrxt42BxFZo4iAnN4CX
Copy link
Contributor

7jrxt42BxFZo4iAnN4CX commented Mar 17, 2025

#68

curl -X POST http://localhost:3000/v1/audio/speech \
                -H "Content-Type: application/json" \
                -d '{
              "model": "anything can go here",
              "input": "Hello, this is a test of the Kokoro TTS system!",
              "voice": "af_sky", 
              "response_format": "mp3"
            }' --output sky-says-hello.mp3

@gnusupport
Copy link

ffmpeg -i "${output_file}" "${m4a}" 2>&1

I'm normally converting the mp3 to m4a file so that it is ready to paste or share on chat systems and then people can easily listen to the voice message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants