Audio is best for AI
You’re sitting with your friends in a coveted booth at your favorite dive bar. The band starts into their third hour while intermittently pool table sounds pierce through the excited chatter of a few hundred other patrons, their glasses and bottles clinking. Your in-ear monitor is presenting you with all of these sounds, but their volume has been carefully adjusted, as if by a sound engineer, so that you can focus on the resonate tones of your friend’s voices around the table. You can clearly hear every syllable of the meandering conversation.
As you start to tell your friends about movie you saw few years ago, the name of the director escapes you. “I think the director was… " as you briefly pause a disembodied voice, not someone at the table, whispers “Taika Waititi” in your ear, and you seamlessly work it into the conversation.
On June 5th Apple unveiled their new $3500 VR headset, the Apple Vision Pro, which comes out early 2024. While Mark Zuckerbergs’s idea of VR is the “metaverse,” Apple refused to even utter the m‑word during their presentation and instead unveiled a different paradigm called “spatial computing.”
Spatial computing means regular computer programs are giant-sized and float in the middle of your room as objects in the world. You don’t need a mouse or touchpad because you direct focus using your gaze and select things with hand gestures. The overall vibe of their videos was clean, modern, and sterile.
While both the metaverse and spatial computing seem cool and sci-fi in a way, to me they also suddenly seem a bit dated. AI has so completely changed my sense of where computing is going that VR now seems like just a side quest; it’s definitely not the next big thing.
I grew up in the era of dial-up internet. When you turned on your computer, it was not connected to the internet. If you wanted to do something online, you dialed up, did your thing, and then disconnected. My use of AI today feels similar. My computer and phone are on all the time, and then once in a while, I pull up ChatGPT or some other AI and ask them a few questions.
I feel confident we are heading towards a future where we have a constant ongoing dialog with AI. So long as a computing device is on, the primary way you interact will be via AI. Every program you run and every website you use, the interaction will almost always be AI mediated. You might be talking to “your AI,” or “their AI,” or some combination of the two.
AI interaction will dominate because AI allows you, and the computer, to leverage humanity’s greatest invention: human language. Human language is powerful because the syntax is infinitely generative, and the vocabulary is vast and extensible. English speakers tend to use around 20,000 words on a regular basis, but there are arguably over 500,000 total words in English, with new words being coined all the time. There is no limit to sentence length, so the number of thoughts you can express with language is functionally infinite.
Language is compact, expressive, and once you’ve spent two decades learning it, pretty easy to use. In contrast traditional graphical user interfaces only allow the user to point and grunt: clicks, taps and pinches are incredibly limited compared to language.
Since AI is going to give us 2-way communication using human language, what is the best device to put you into a constant dialog with AI? Rather than Vision Pro, I’d like to suggest Apple create a device called Hearing Pro, although they might elect just to keep their current branding: AirPods. Audio is the best way to put humans into constant dialog with AI, and constant dialog is where we are heading.
I’m actually a big fan of VR, long term, but the Vision Pro is a heavy device with a heavy external battery pack — it only runs for two hours on a charge. The weight alone vastly limits how long you can wear the device. Estimates suggest the headset weighs around 500 grams and the battery pack weighs 200 grams, for a total of just over 1.5 pounds. A pair of AirPods Pro, in contrast, weighs only 10 grams. Therefore Vision Pro weighs as much as 50 pairs of AirPod Pros strapped to your face and another 20 on your hip.
And crucially, anything that covers your eyes is socially awkward. The white part of our eyes evolved so we can estimate people’s gaze direction, plus eyes can reveal emotion, interest and engagement, honesty or deception, and more. It’s critical to have unfettered visual access to people’s eyes when talking to them. Vision Pro’s outward-facing display is a cute trick, but it doesn’t replace the real thing, and might in fact be uncanny and therefore creepy.
Human vision is our highest bandwidth sense, by far, and someday I’d love to have a device that could augment my vision and be worn all day, but the Vision Pro is not that device. For the foreseeable future, audio is a much better way to put you into dialog with an AI, and it can always direct you to your phone screen for visuals.
There’s actually a very strong analogy between Vision Pro and AirPods Pro. In both cases, the device blocks out one of your senses but then feeds you a mediated version of that same sense. With Vision Pro, it completely blocks out all light, but then using cameras, it displays the outside world on its screens for you to see. AirPod Pro, meanwhile, blocks out most external sounds but then uses microphones to play back those same outside sounds.
Why do both devices block out a sense just to restore it digitally? The reason is that it places that sense under software control. The software can now modify or augment what you see or hear. With Vision Pro, the software can insert your photo album as if it were floating in the middle of your living room. With AirPods Pro, it can block out the rumble of the subway and replace it a soothing voice summarizing the day’s news, or reading messages from friends.
For decades musicians have used a device called an “in-ear monitor.” An audiologist creates a detailed mold of the musician’s ear canal, and then another company manufactures custom earbuds which fits that person perfectly. Custom in-ear monitors might easily cost $1500, but they are the gold standard for music, and I could see them becoming the high-end option for Apple’s AirPods. The in-ear version would have two main advantages: they’d block out outside sound even better, and they’d never fall out. People will notice you have something in your ears, but it won’t be nearly the social disruption as having your face covered. We look into people’s eyes when we talk to them, not into their ears.
AI will be so valuable in the future because it will know everything about you — your emails, your texts, your calendar, your work Slack, your social media, your documents and notes. You’ll be able to ask it questions like, “Someone wrote me last month who was really angry, and I didn’t write them back, who was it?” or “Do I have any texts that require a response right now?”
You’ll be able to direct your AI to take action using any product or service you have access to. The AI can interact with your calendar, Uber, an airline, hotel, kid’s soccer team, your finances, anything. The AI will also listen to your physical surroundings. If a friend asks a questions, the AI can whisper to you information that helps you answer. And, of course, the AI can do all the normal audio stuff: music, podcasts, and phone calls. The more AI can do, the higher the value proposition is for remaining connected to AI at all times.
I also like audio as a modality because it will segue smoothly into brain-machine interfaces, which are on the way. At some point, you will be able to just think about your questions and get the answers whispered in your ear. Longer term, even the reply will be delivered to your brain directly, without audio.
Many times I’ve had the same conversation with my kids. I explain that their life today is actually the distant past; it’s the distant past of a future time when they are very old. And I say in that future time they might be talking to a grandchild, and that grandchild will be blown-away by how primitive everything is. How primitive everything is today, right now.
But until the recent developments in AI, I could never really imagine what change could make today look primitive. Sure phones and computers would get faster, lighter, and cheaper but was that it? Just refining what we have? I think a constant and effortless dialog with AI in future is what’s going to make today’s world seem primitive, and it’s going to happen one ear at a time.