ChatGPT can now see, hear, and talk to some users

ChatGPT has a voice—or, rather, five voices. On Monday, OpenAI announced its buzzworthy, controversial large language model (LLM) can now verbally converse with users, as well as parse uploaded photos and images.

In video demonstrations, ChatGPT is shown offering an extemporaneous children’s bedtime story based on the guided prompt, “Tell us a story about a super-duper sunflower hedgehog named Larry.” ChatGPT then describes its hedgehog protagonist, and offers details about its home and friends. In another example, the photo of a bicycle is uploaded via ChatGPT’s smartphone app alongside the request “Help me lower my bike seat.” ChatGPT then offers a step-by-step process alongside tool recommendations via a combination of user-uploaded photos and user text inputs. The company also describes situations such as ChatGPT helping craft dinner recipes based on ingredients identified within photographs of a user’s fridge and pantry, conversing about landmarks seen in pictures, and helping with math homework—although numbers aren’t necessarily its strong suit.

According to OpenAI, the initial five audio voices are based on a new text-to-speech model that can create lifelike audio from only input text and a “few seconds” of sample speech. The current voice options were designed after collaborating with professional voice actors.

Unlike the LLM’s previous under-the-hood developments, OpenAI’s newest advancements are particularly focused on users’ direct experiences with the program as the company seeks to expand ChatGPT’s scope and utility to eventually make it a more complete virtual assistant. The audio and visual add-ons are also extremely helpful in terms of accessibility for disabled users.

“This approach has been informed directly by our work with Be My Eyes, a free mobile app for blind and low-vision people, to understand uses and limitations,” OpenAI explains in its September 25 announcement. “Users have told us they find it valuable to have general conversations about images that happen to contain people in the background, like if someone appears on TV while you’re trying to figure out your remote control settings.”

For years, popular voice AI assistants such as Siri and Alexa have offered particular abilities and services based on programmable databases of specific commands. As The New York Times notes, while updating and altering those databases often proves time-consuming, LLM alternatives can be much speedier, flexible, and nuanced. As such, companies like Amazon and Apple are investing in retooling their AI assistants to utilize LLMs of their own.

OpenAI is threading a very narrow needle to ensure its visual identification ability is as helpful as possible, while also respecting third-parties’ privacy and safety. The company first demonstrated its visual ID function earlier this year, but said it would not release any version of it to the public before a more comprehensive understanding of how it could be misused. OpenAI states its developers took “technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people” given the program’s well-documented issues involving accuracy and privacy. Additionally, the current model is only “proficient” with tasks in English—its capabilities significantly degrade with other languages, particularly those employing non-roman scripts.

OpenAI plans on rolling out ChatGPT’s new audio and visual upgrades over the next two weeks, but only for premium subscribers to its Plus and Enterprise plans. That said, the capabilities will become available to more users and developers “soon after.”