Follow

I'm happy that we're getting more and more lifelike text to speech voices using AI, but here's something you might not know. These AI based text to speech voices can be unpredictable. It's not that they say things wrong or mispronounce any more than other speech synths do, but what definitely does happen is that it does not say the same string of text the same way twice. It might change the intonation, or even sometimes the speed of certain syllables from utteration to utteration. I use my screen reader with the speed very, very fast. Often I don't pay conscious attention to exactly what words are spoken because I've gotten so used to the text to speech voices that I use that my brain does this subconsciously. They have certain patterns that I can recognize and this tells me what the synth just said without having to understand every single syllable or word. This is important for reading short texts like names of buttons, window titles, web addresses, messages, usernames, etc.

I much prefer very algorithmic, synthetic speech for this. Not only is it very predictable in how it pronounces things, but it also speeds up much more. If you speed up, for example, Google's Wavenet voices, they start slurring words. This is obviously no good at all. It's authentic, sure, but it's annoying to me. I'm happy to use AI speech, for example the Siri voices that come with the new MacOS, if I'm reading something longer like a book, story and so on. But for every day use? No thanks. I think it's important that we don't get too carried away here. If I had the choice, I would choose a non natural voice. And that by quite a big margin. Here's your fun fact of the day!

And let's not even talk about code. A natural voice reading code is just... it just doesn't work. It just feels totally wrong. I need to navigate through code very fast. Not only do AI voices have quite a bit of latency, but if I'm quickly scrolling through a file I'm listening to the actual words just as much as I'm listening for familiar sequences of sounds. AI based TTS don't have that because things are ever so slightly different.
This also means that cloud anything is absolutely out. If you're making web requests to get your screen reader to speak then stop right now. I won't use it, you wouldn't use it, nobody would use it. I guess Apple can do this on their new devices because of the M1 platform, but even there you can absolutely feel the delay between pressing the key and the voice reacting to what you've done. The simpler the tts, the faster the response time, the happier I am.

@talon Same here! And I think most screen reader users might agree.
This is why people are, after all these years, still using Eloquence. Because it just works! It reacts to various punctuation, but doesn't make every single string of text an emotional experience. eSpeak gets close to that, at least with inflection lowered a bit.
I wish there were more people creating synths like that now, things that are both fast and pleasant to listen to. Human-like voices are useful, and I'm happy even sighted people have started to use them for reading articles, but they can't replace everything.

@Mayana @talon Ugh, I've at least made modest peace with Microsoft's OneCore voices, but it randomly and through no pattern I've discovered replaces "as" with "American Samoa". I'm not at all down with a voice that does this. And no, AFAICT it isn't "AS" vs. "as", it just randomly swaps out one for the other. That's...disturbing, to say the least. Our aural culture is bad enough already.

@nolan That's quite odd! Not something I've noticed here, but admittedly I only ever use David, and only rarely.
@talon

@Mayana @talon OK, I've heard "American Samoa" before, and I can't find a string that triggers it, but this one triggers something interesting for me using Microsoft Mark and US English, can't say what it would do with other combos:

10 The passage.mp3

I mean, that's a filename, so part of me is like "Whatever." But that's...a lot of intelligence to apply to how a string is presented, and I don't like TTS engines doing that. Mispronounce something predictably, don't add a million special cases that totally change the meaning of what is spoken, without any indication that those changes are being made.

@nolan @Mayana oh my god that made me snort. Yeah that's awful. I do not like the OneCore voices. They're just... slow. On Windows I'm definitely guilty... and use Eloquence. On other systems it's either ESpeak or Vocalizer.
You know, I would be happy with ESpeak, but to me it just sounds too metallic and sharp. What I like about Eloquence is that it has a relatively warm voice tone, which makes it easier to listen to all day. ESpeak is much sharper, and its consonants have a weird attack. They stick out. Eloquence, and most other voices, soften them a lot. I prefer that.

@nolan TBF, Eloquence can do that too. There's some odd cases where it sees a date even though there is none. But OK, a least dates can be useful, unlike this! Why would we ever need to know those region codes so damn badly?
@talon

@Mayana @talon Fair, Eloquence certainly does do that. But I don't think it's fair to track that all the way down. For all its flaws, the original DECTalk had some interesting quirks itself. My issue is that, as those quirks get more sophisticated, they also get harder to learn your way around. I hate to assume people are stupid, but blind people are going to have some odd beliefs around American Samoa or Northern Mariana Islands, and gods know what else, if TTS keeps interjecting itself in that pathway.

Sign in to participate in the conversation

A fun, happy little Mastodon/Hometown instance. Join us by the fire and have awesome discussions about things, stuff and everything in between! Admins: @Talon and @Mayana.

<svg xmlns="http://www.w3.org/2000/svg" id="hometownlogo" x="0px" y="0px" viewBox="25 40 50 20" width="100%" height="100%"><g><path d="M55.9,53.9H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,53.9,55.9,53.9z"/><path d="M55.9,58.2H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,58.2,55.9,58.2z"/><path d="M55.9,62.6H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,62.6,55.9,62.6z"/><path d="M64.8,53.9c-0.7,0-1.3,0.6-1.3,1.3v8.8c0,0.7,0.6,1.3,1.3,1.3s1.3-0.6,1.3-1.3v-8.8C66,54.4,65.4,53.9,64.8,53.9z"/><path d="M60.4,53.9c-0.7,0-1.3,0.6-1.3,1.3v8.8c0,0.7,0.6,1.3,1.3,1.3s1.3-0.6,1.3-1.3v-8.8C61.6,54.4,61.1,53.9,60.4,53.9z"/><path d="M63.7,48.3c1.3-0.7,2-2.5,2-5.6c0-3.6-0.9-7.8-3.3-7.8s-3.3,4.2-3.3,7.8c0,3.1,0.7,4.9,2,5.6v2.4c0,0.7,0.6,1.3,1.3,1.3 s1.3-0.6,1.3-1.3V48.3z M62.4,37.8c0.4,0.8,0.8,2.5,0.8,4.9c0,2.5-0.5,3.4-0.8,3.4s-0.8-0.9-0.8-3.4C61.7,40.3,62.1,38.6,62.4,37.8 z"/><path d="M57,42.7c0-0.1-0.1-0.1-0.1-0.2l-3.2-4.1c-0.2-0.3-0.6-0.5-1-0.5h-1.6v-1.9c0-0.7-0.6-1.3-1.3-1.3s-1.3,0.6-1.3,1.3V38 h-3.9h-1.1h-5.2c-0.4,0-0.7,0.2-1,0.5l-3.2,4.1c0,0.1-0.1,0.1-0.1,0.2c0,0-0.1,0.1-0.1,0.1C34,43,34,43.2,34,43.3v7.4 c0,0.7,0.6,1.3,1.3,1.3h5.2h7.4h8c0.7,0,1.3-0.6,1.3-1.3v-7.4c0-0.2,0-0.3-0.1-0.4C57,42.8,57,42.8,57,42.7z M41.7,49.5h-5.2v-4.9 h10.2v4.9H41.7z M48.5,42.1l-1.2-1.6h4.8l1.2,1.6H48.5z M44.1,40.5l1.2,1.6h-7.5l1.2-1.6H44.1z M49.2,44.6h5.5v4.9h-5.5V44.6z"/></g></svg>