Does this voice sound familiar to you?
You’ve probably heard it in an advertisement, at the airport checkout, when ordering delivery pizzas, or when listening to your favorite audiobook. That’s how artificial voices or deepfakes started with a lousy reputation for their use in scam calls and internet trickery. But, since the Macintosh introduced itself with text-to-speech in 1984, AI voice’s improving quality has piqued the interest of a growing number of companies.
In particular, it’s the speech synthesis technology that has evolved with an incredible degree of realism that many tech giants believe AI voice to be the future. Recent discoveries in deep learning and machine learning, ML, have made it possible to replicate many of the subtleties of human speech. Text-to-speech voices that used to sound robotic, monotonous and life-less have now been transformed into natural, realistic celebrity voices.
What’s so fascinating about them is not only their capability of mimicking human-like speech, but also of generating full-length near-perfect audio narrations. The produced AI voices can even learn to pause and breathe in all the right places while changing their style or emotion. So, for AI audio clips, you wouldn’t distinguish the machine’s voice from a human voice. You might only spot the trick if they speak for too long.
If you’re interested in AI voice, how it started, how it developed, and how loud it’s going, follow along this blog.
History of AI Voice
Voices were here even before the light bulb. One of the earliest working dictation machines was the phonograph built in 1877. The phonograph was made of a stylus that etched grooves into a rotating, tinfoil-covered cylinder in response to the pressure produced by sound vibration. The sound vibration record could then be used in reverse to vibrate the stylus and turn those movements back into audible sound.
An upgrade on the phonograph was the Volta Graphophone Company in 1886. The Graphophone operated using wax-coated cylinders instead of foil, which allowed for longer recordings and higher-quality playback.
Both the Graphophone and the phonograph were the hype devices in the late 19th century used primarily for dictating letters and scientific documents sent overseas.
However, the first true speech recognition device was Audry, built in 1952 by Bell Laboratories researchers. Audry could not only dictate, but actually understand words like digits from 0 to 9 if speakers paused in between. Audry was theoretically accurate, but it was so huge to be practically used. It was housed in a six-foot-tall relay rack with heavy power generators.
While Audry could only understand, the first machine to ever express itself with a human voice sound was IBM Shoebox in 1961. IBM Shoebox was the first digital speech recognition tool in the world with the super ability to recognize 16 words and 9 digits. It might sound too primitive now, but, in its days, it was a headstart that laid the foundation of the AI voice assistant technology.
Three decades later, in 1990, Dragon launched a consumer speech recognition product called Dragon Dictate for $9000. Dragon kept pushing its way through AI voice technology with improved products and upgraded versions. That’s when Dragon NaturallySpeaking was released in 1997. The Dragon NaturallySpeaking application could recognize continuous speech and around 100 words per minute. This was one huge communication milestone.
Microsoft joined the AI voice club introducing Clippy in 1996. As one of the earliest voice assistants, Clippy used Microsoft’s Speech Recognition Engine to allow speech inputs and provide answers and suggestions. Clippy was obviously discontinued, however its framework is still with us today. Clippy set the standard for virtual assistants to only come forward when called for rather than speaking constantly.
Evolution of AI Voice
From 1990 till 2011, the considered modern era of voice assistants started with Apple Siri. Siri was your personal voice assistant who could recognize your voice, but needed you to repeat yourself a few times. Siri was soon followed by Google Now in 2012, and more recently Microsoft Cortana and Amazon Echo in 2014. This major development witnessed in the last decade was possible thanks to big tech companies’ large investments in AI.
“It’s the future of web searching”.
Google (2011)
On the search side of things, Google spent lots of time and resources to come up with text-to-speech recognition features embedded inside its applications. Google Chrome users could then successfully use their microphone for text-to-speech functionalities within Google Search. AI voice assistance also appeared as individual smart speaker systems that entered the market leaving big tech companies battling out to capture as much market share as possible.
AI Voice At Present
The present numbers are huge!
Speech recognition rate with Google has now risen to 95%. China’s IFlytek’s speech recognition system has also reached close to a 98% accuracy rate. This means that talking to your voice assistance is 98% just like talking to your friend… except that AI connected to servers knows more.
On a larger business scale, AI voice has grown popular among brands looking to maintain a consistent sound in millions of interactions with customers. AI voice assistance is found today in automated customer service agents, digital assistants embedded in cars and smart devices, and in millions of modern houses.
Voice assistants can now be used as real time voice changers. For instance Voicemode Beta can transform your voice into 8 options from fantasy characters, pilots, astronauts, to actors like Morgan Freeman. AI voices processed in real time could be ideal for live streaming or fun calls totally for free. Free is the dangerous word here because free of charge does not necessarily mean free of cost.
The fast development of AI voice assistance and the increased competition for it leave us users with concerns about privacy, data collection, and frauds. But, to end this blog on a positive note, imagine how cool the future would be where AI voice is expected to morph into ambient voices embedded in the environments we inhabit, not just trapped inside our devices.