Introduction
We strive to advance artificial intelligence responsibly and transparently. In this piece, we delve into the workings of our Voice Engine, a sophisticated text-to-speech (TTS) model capable of generating human-like audio from text. We also discuss our ongoing safety research to ensure the responsible deployment of this technology.
How Voice Engine Works
Voice Engine is powered by a cutting-edge TTS model that can generate audio from text and a brief 15-second sample of a speaker's voice. This model learns the nuances of speech by analyzing paired audio and transcriptions, enabling it to predict the most probable sounds a speaker would make for any given text.
The model employs a diffusion process, starting with random noise and progressively de-noising it to closely match the articulation of the speaker from the sample audio. This allows the creation of spoken text that reflects various voices, accents, and speaking styles.
Development and Early Testing
Developed in late 2022, Voice Engine underwent extensive internal testing with a mix of public and private voice samples. This phase was crucial for our alignment and safety research, helping us understand the technical frontiers and establish necessary safeguards. The outputs of these tests were reserved solely for internal assessments.
Collaboration with Policymakers
As part of our iterative deployment framework, we engaged with global policymakers to demonstrate the capabilities and associated risks of synthetic voice models. This engagement started in the summer of 2023, contributing significantly to our safety research and policy development.
Limited Releases and Use Cases
In September 2023, Voice Engine powered ChatGPT’s Voice Mode, using real voices selected through a detailed process involving professional voice actors and industry advisors. In November 2023, we launched a simple TTS API with six preset voices created from 15-second samples by professional voice actors.
Safety Measures and Future Directions
Building Voice Engine safely is a top priority. We collaborate with partners across various sectors to incorporate feedback and ensure ethical usage. Partners must adhere to strict usage policies, including prohibiting impersonation without consent, requiring explicit approval from original speakers, and disclosing AI-generated voices to listeners. We also implement safety measures like watermarking and proactive monitoring.
Looking ahead, our latest model, GPT-4o, integrates native audio capabilities, presenting new interaction opportunities and risks. We are actively red-teaming GPT-4o to address potential risks in areas such as social psychology, bias, and misinformation. Our cautious approach includes restricting GPT-4o’s audio outputs to preset voices from professional actors and developing new classifiers to mitigate risks.
Conclusion
We are committed to advancing AI technology responsibly. Our ongoing efforts in developing and deploying Voice Engine and GPT-4o reflect our dedication to safety, transparency, and ethical use. As we continue to innovate, we will keep stakeholders informed and engaged, ensuring that synthetic voice technology benefits society while minimizing potential risks.
Add a Comment: