TTS vs Human voice

Written by Premier CX Team | 21-Apr-2022 16:28:43

I’ve been working with real voices for over 20 years, and to my mind (and ears) there’s still a clear difference between a real-life, professional voice artist and what even the best Text to Speech (TTS) engines can produce. So I’ll set my stall out from the start… as good as it’s becoming, I don’t view TTS as a replacement for a professional voice artist. I will admit that the gap has closed, and in terms of sheer practicality TTS really does have advantages in some circumstances:

TTS is a fantastic tool if you need to get an urgent message onto your system, especially if that message is going to be short-lived. Let’s say your online portal goes down… your phone lines light up with worried customers. With a TTS option you can have a message typed up and ready to deploy in minutes, informing callers there’s a problem and you’re on it. Result: a whole lot of callers will abandon that call, better informed and reassured, taking pressure off your staff.

Alternatively, suppose you need a message when your regular voice artist is on leave, or maybe it’s a 5 o’clock in the morning situation. In all these cases, a TTS option can be there for you 24/7. It also represents a consistent voice that isn’t suddenly going to leave your company and become suddenly unavailable for any future recordings.

But hang on a minute… I said that TTS is NOT a replacement for a real voice, before waxing lyrical about how awesome it is. Let’s un-box that apparent dichotomy:

At Premier CX we’ve engaged with TTS technology from multiple platforms and suppliers. The best of breed TTS voices are now ‘neural’ which is to say that they’re based on neural networks, in essence a form of AI that learns from input and experience. When you input a script, the TTS processes it against everything that the system has seen before and applies its own variations in tone, pronunciation, and pacing. They’re getting better… but even these are still applying ‘best guess’ to the final output. On a simple first pass – the audio output often misses pronunciations, garbles unfamiliar words, misses the desired emphasis or just sounds ‘off’ somewhere in the sentence. It does the job. It does it fast, it sounds…. Okay-ish. It’s not human, it’s not polished.

A human being, particularly a native speaker of a language is still leaps and bounds ahead, because it’s something we ‘live’ it’s something we ‘inhabit’ at an instinctual level that we’ve learned across our entire life to date. Professional voice artists do this, but also apply an extra level of awareness about the type of communication, who the audience is and how to manipulate their own voice. So for a polished result, a human result, the pro voice artist wins.

TTS can be improved of course. You don’t have to just type in a script and accept that first pass bit of audio. TTS systems allow an operator to take that script and to use various tools and mark-up language to adjust the audio. You can nudge it here, push for emphasis there, change the pacing and so-on. I’ve done this, I’ve managed to produce improved, TTS audio that was deemed ‘acceptable’ for use. It still wasn’t as good as a real, pro voice artist. It also took me easily more than 10 times longer to do that than it would to read the same script into a microphone. I should note that I’m a ‘casual’ user of the system – I’m not doing it 100% of the time. A professional user of a TTS system would be faster than I am, but it’s also reasonable to say that most contact centres won’t have a professional level TTS user available either. A studio-based colleague of mine put it succinctly: By comparison to a human recording, TTS takes more effort for a less-perfect result.

For our clients you can think of it like this: with TTS they could spend a whole lot of time messing with mark-up script and TTS tools to try and get something that sounds sort of like what they want… or they could just message us, get on with other work and wait for the much better, real voice version of the message to arrive in their FTP folder. Meanwhile, we’ve engaged our pro VO on their behalf, knowing that the vast, overwhelming majority of all recordings come back right first time. Is it instant? No, but as I’ve noted, neither is high quality TTS. Is it less effort and resource than producing good TTS might require from you or a member of your team? Yes.

So… use TTS. But use it for what it suits so incredibly well – emergency, must have it now messages, that can be replaced with a real recording at the earliest opportunity. For the bulk of your messages and prompts, have a human voice – with all the human instinct, experience, brand awareness and customer-relatability that a real voice can give you.

If you would like to read more, check out our 'Music to your ears' article or many other great reads in The Good CX Guide or give us a call at Premier CX.

View full post