OpenAI. Two instances of voice mode talking to each other, one of which has camera access to describe the room. Three minutes long and the most efficient way to internalise what makes voice mode different from old "press the microphone, wait, listen" interfaces — interruption, tone, music, real-time vision, all in one clip.
Keep this as a short intuition pump only. Do not treat it as evidence that a production voice agent can safely handle real users without disclosure, logging boundaries, fallback and human escalation.
See multimodal voice interaction quickly, especially interruption, tone and camera-aware conversation.
Know that demo behavior may differ from the product, region and account tier available to you.
Continue through the same learning path with the next curated companion videos.