This is an automated archive made by the Lemmit Bot.

The original was posted on /r/artificial by /u/Successful-Western27 on 2023-08-09 14:20:43.


If you’re creating voice-enabled products, I hope this will help you choose which model to use!

I read the papers and docs for Bark and Tortoise TTS - two text-to-speech models that seemed pretty similar on the surface but are actually pretty different.

Here’s what Bark can do:

  • It can synthesize natural, human-like speech in multiple languages.
  • Bark can also generate music, sound effects, and other audio.
  • The model supports generating laughs, sighs, and other non-verbal sounds to make speech more natural and human-sounding. I find these really compelling and these imperfections make the speech sound much more real. Check out an example here (scroll down to “pizza.webm”).
  • Bark allows control over tone, pitch, speaker identity and other attributes through text prompts.
  • The model learns directly from text-audio pairs.

Whereas for Tortoise TTS:

  • It excels at cloning voices using just short audio samples of a target speaker. This makes it easy to produce text in many distinct voices (like celebrities). I think voice cloning is the best use case for this tool.
  • The quality of the synthesized voices is pretty high.
  • Tortoise supports fine-grained control of speech characteristics like tone, emotion, pacing, etc through priming text.
  • Tortoise is only trained on English and it’s not capable of producing sound effects.

Here’s how they compare to the other speech-related models I’ve taken a look at so far:

Model Best Use Cases Key Strengths
Bark Voice assistants, audio generation Flexibility, multilingual
Tortoise TTS Audiobooks, voice cloning Natural prosody, voice cloning
AudioLDM (full guide) Voice assistants High-quality speech and SFX
Whisper Transcription Accuracy, flexibility
Free VC Voice conversion Retains speech style

I have a full write-up here if you want to read more, it’s about a 10-minute read. I also looked at the model inputs and outputs and speculated on some products you can build with each tool.