I'm happy that we're getting more and more lifelike text to speech voices using AI, but here's something you might not know. These AI based text to speech voices can be unpredictable. It's not that they say things wrong or mispronounce any more than other speech synths do, but what definitely does happen is that it does not say the same string of text the same way twice. It might change the intonation, or even sometimes the speed of certain syllables from utteration to utteration. I use my screen reader with the speed very, very fast. Often I don't pay conscious attention to exactly what words are spoken because I've gotten so used to the text to speech voices that I use that my brain does this subconsciously. They have certain patterns that I can recognize and this tells me what the synth just said without having to understand every single syllable or word. This is important for reading short texts like names of buttons, window titles, web addresses, messages, usernames, etc.


I much prefer very algorithmic, synthetic speech for this. Not only is it very predictable in how it pronounces things, but it also speeds up much more. If you speed up, for example, Google's Wavenet voices, they start slurring words. This is obviously no good at all. It's authentic, sure, but it's annoying to me. I'm happy to use AI speech, for example the Siri voices that come with the new MacOS, if I'm reading something longer like a book, story and so on. But for every day use? No thanks. I think it's important that we don't get too carried away here. If I had the choice, I would choose a non natural voice. And that by quite a big margin. Here's your fun fact of the day!

And let's not even talk about code. A natural voice reading code is just... it just doesn't work. It just feels totally wrong. I need to navigate through code very fast. Not only do AI voices have quite a bit of latency, but if I'm quickly scrolling through a file I'm listening to the actual words just as much as I'm listening for familiar sequences of sounds. AI based TTS don't have that because things are ever so slightly different.
This also means that cloud anything is absolutely out. If you're making web requests to get your screen reader to speak then stop right now. I won't use it, you wouldn't use it, nobody would use it. I guess Apple can do this on their new devices because of the M1 platform, but even there you can absolutely feel the delay between pressing the key and the voice reacting to what you've done. The simpler the tts, the faster the response time, the happier I am.

