The adoption of virtual assistants (Google Home, Amazon Alexa, Siri, Cortana) has been staggering over the last two years. Alongside the AI used to understand the questions asked of them, has been a vast improvement in the Text to Speech (TTS) used to deliver the answers. TTS has moved from being ‘intelligible but robotic’ to sounding human… but is that good enough, or is there a higher level of humanness that the major manufacturers are aspiring to?
Premier CX has been a leading provider of voice artist recordings for contact centres for 25 years, and we’ve long been monitoring TTS. Our feelings towards it have moved from laughter at obvious mispronunciations; to keeping a keen eye on it; to partnering with one of the world’s leading university faculties that focuses on TTS; to bringing TTS development in-house and embedding it as part of some of our solutions. It’s good… very good now!
Does TTS signal the end of the contact centre voice artist?
No way! TTS now sounds very human, but most callers can still discern that it is not. That puts TTS a little below the level of 7.8 billion people in the world who all sound totally human… not surprisingly as they are! The voices of successful professional voice artists and radio presenters stand out massively from non-voice artists. They have an imperceptible quality to their voices that draws the listener’s attention no matter how dull the topic they are talking about.
We are approached by 10-20 people each week saying they are voice artists wanting to work with us… but they are not. They are just people with a microphone who think that because they can speak they can be a voice artist… but they just don’t have the ‘Je ne sais quoi’.
We worked with our university partner to identify the essence of what makes a successful voice artist stand out… and failed. They speak 10-30% faster than the rest of us without losing the clarity of what they say, but there is something else. It became apparent that the ‘something else’ varied between each voice artist in the same way that handsome men might often be tall and dark, but they still look very different in all manner of ways!
Many of our customers get overly hung up on choosing a voice artist, without understanding that the way the voice is ‘produced’ is far more important. As professionals, voice artists are accustomed to being produced. Meryl Streep was equally convincing as Margaret Thatcher in the Iron Lady, Karen Blixen in Out of Africa and Donna in Mama Mia.
Prompts and messages for phone systems are usually quite short, and it is impossible for a TTS engine to deduce from the text alone where the emphasis needs to be – as the examples below illustrate. Nuances of emphasis and tone can completely change a listener’s impression, which is why we often involve our customers in recording sessions.
Thank you for calling. Thank you for calling. Thank you for calling.
SSML (Speech Synthesis Mark-up Language) and some TTS tuning tools offer the ability to change the output to a certain extent, but they are quite crude and broadly restricted to pitch, tone and volume. Some engines support a change in style from the ‘information delivery’ tone you get from virtual assistants or satellite navigation systems to newscaster, chat or empathetic style. To date, we’ve not found a promotional style from any of the main TTS providers that satisfies the style most often needed for in-queue messaging or IVR prompts.
These examples illustrate the points above for EAW Fixings, a fictitious company;
Raw text to speech using one of the best British text to speech characters.
□ Human sounding, but some mispronunciations and a rather flat read.
Edited text to speech using speech synthesis mark-up language (SSML) and studio editing.
□ Mispronunciations corrected and more emotion / expression added.
On-brand real voice, studio produced.
□ Choice of voice aligned with company audio brand guidelines.
Does TTS have a Place in Contact Centre Solutions?
Yes …ish, … which is why we are investing in it. The major advantage of TTS is that it is instantaneous – so ideal for prototyping call flows, but we still think an on-brand recording by a real human voice gives a significantly better caller experience, and should replace the TTS as soon as it is available. Note that with Premier CX’s Emergency recording SLA of 2 hours, it should be rare that TTS prompts are ever heard by callers.
We are often asked, “Which TTS engine is better, Amazon Polly, Google, IBM or Microsoft?”. The answer is that it is different for different languages and dialects. The quality of the TTS is dependent on the TTS engine and the input recordings used to create the TTS character. The best male Irish TTS character may be produced by a different manufacturer from the best female Irish character. Ultimately there is no global ‘best’ even for a particular gender/language, as it depends what persona you are looking for.
Find out more
This is a fast- moving technology, and we will be blogging about it again soon… so please subscribe if you want to be kept up to date.