Speech synthesis has come a long way since the 1978 Speak & Spell toy, which once wowed people with its state-of-the-art electronic speech-reading capabilities. Now, using deep learning artificial intelligence models, software can not only create realistic-sounding sounds but also convincingly imitate existing sounds using small audio samples.
Along these lines, OpenAI this week released Speech Engine, a text-to-speech artificial intelligence model for creating synthetic speech from 15-second recorded audio clips. It provides audio samples of the speech engine in action on its website.
After cloning the voice, users can input text into the speech engine and get AI-generated speech results. But OpenAI isn’t ready to release its technology widely yet. The company originally planned to launch a pilot program for developers to sign up for the Speech Engine API earlier this month. But after thinking more about the ethical implications, the company decided to scale back its ambitions for now.
“In line with our stance on AI safety and our voluntary commitments, we have chosen to preview but not release this technology broadly at this time,” the company wrote. “We hope this preview of the speech engine both highlights its potential and Stimulating the need to build social resilience to the challenges posed by increasingly compelling generative models.”
Overall, speech cloning technology is not particularly new – multiple AI speech synthesis models have emerged since 2022, and the technology is very active in the open source community, including software packages such as OpenVoice and XTTSv2. But the idea that OpenAI is gradually letting anyone use its particular brand of voice technology is noteworthy. In some ways, the company’s reluctance to fully release it may be the bigger story.
OpenAI says the benefits of its speech technology include providing reading assistance through natural voices, providing creators with global reach by translating content while preserving native accents, providing personalized speech options for non-verbal individuals, and assisting patients after surgery. Get your voice back. Speech disorders.
But it also means that anyone with 15 seconds of someone’s recorded voice can effectively clone it, which has obvious implications for potential abuse. Even though OpenAI has never widely released its speech engine, the ability to clone voices has caused trouble in society, such as phone scams in which people imitate the voices of loved ones, and campaign robocalls cloning the voices of politicians like Joe Biden.
Additionally, researchers and journalists have shown that voice cloning technology can be used to break into bank accounts that use voice authentication, such as Chase’s Voice ID, prompting U.S. Sen. Sherrod Brown of Ohio, chairman of the U.S. Senate Banking Committee make a suggestion. The Ministry of Housing and Urban-Rural Development will write to the CEOs of several major banks in May 2023 to inquire about the security measures banks have taken to address AI risks.
OpenAI recognized that the technology could cause trouble if widely released, so it initially tried to address those issues through a set of rules. It has been testing the technology with a select group of partner companies since last year. For example, video synthesis company HeyGen has been using the model to translate a speaker’s voice into other languages while maintaining the same voice.