Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
cdminix 
posted an update Aug 5
Post
495
I just added 5 more models to my open source TTS model benchmark, ttsds/benchmark.
Let's talk about the results!

Over the last couple days, I added jbetker/tortoise-tts-v2, metavoiceio/metavoice-1B-v0.1, audo/HierSpeechpp, and the unofficial implementations of amphion/NaturalSpeech2 and amphion/valle by https://huggingface.co/amphion

Takeaways:
- TorToiSe does very well, falling into second place after StyleTTS 2, which is also ranked first in the human evaluation at TTS-AGI/TTS-Arena.
- MetaVoice-1B's overall score is dragged down by its Intelligibility Score (probably due to utterances being cut short), it achieves #3 in Speaker Score, which indicates good voice cloning ability.
- HierSpeech++ lands in the middle of the road in terms of performance, but excels at the Environment Score, achieving #2 - this means the model is especially good at modeling recording conditions such as microphone and background noise.
- The Amphion models, possibly due to not being trained for the same amount as in the papers, achieve relatively low scores. However, they seem to struggle for different reasons. The autoregressive VALLE models have low Intelligibility Scores (possibly due to "babbling" or early stop tokens) while NaturalSpeech2 has low Speaker and Prosody scores.

What's next?
I'm planning to add more open source TTS models like suno/bark, CAMB-AI/MARS5-TTS and fishaudio/fish-speech-1.2. I'll also write an article on these and all the other results soon, since our paper, TTSDS -- Text-to-Speech Distribution Score (2407.12707), mostly focused on establishing the benchmark itself rather than the indiviual TTS systems.

Excited for the fish-speech and mars5 results

I was very thrilled to see Tortoise in the list, especially to see it got 2nd place! Tortoise was one of my first big projects, and to be honest, it feels like the best candidate for the 'perfect' TTS pipeline, if one were to exist.

Also, please note that Tortoise can have MUCH better results when you actually fine-tune it. I'm pretty sure 11labs are basing their whole TTS pipeline on Tortoise.

It's a shame Tortoise didn't get more love, as it surely deserves it. Imagine the quality we could have if Tortoise was outputting 44kHz WAV instead of ~22-24 kHz, and if the quirks of it were ironed out (the random noises it would often produce and the repetition, for example).

Here's an example of one of the best fine-tunes I have made for Tortoise: https://huggingface.co/SicariusSicariiStuff/TTS_Lola

·

Totally agree! Tortoise seems to not get benchmarked/compared to as much as other systems, and I don't know exactly why.

Not just for Tortoise, but for all theses systems it would be interesting how they compare to each other when finetuned. Unfortantely I don't know of any benchmarks/papers that have tried to evaluate that (yet).