Should you ever be in any doubt over the truth of Arthur C. Clarke’s Third Law (the one that states that any sufficiently advanced technology is indistinguishable from magic), just try a HoloLens 2 experience where you can just tell holograms what to do.
Anyone who manages to walk away from that without feeling a little bit like a Harry Potter wizard (or witch) casting a successful spell, is likely leading a fairly jaded and joyless existence. There is something fundamentally satisfying, almost primal, to the empowerment of watching the world change as you command it to.
Yet even the naysayers can probably agree that voice-based technologies driven by artificial intelligence represent a huge market opportunity. Millions of us now own some sort of room-based voice device such as Google Home or Amazon Echo and the global voice and speech recognition market size is estimated to reach USD 31.82 billion by 2025. We are now witnessing a convergence of various emerging technologies in that space – such as voice recognition, natural language processing, and machine learning-powered by 5G connectivity and the AR cloud.Millions of us now own some sort of room-based voice device such as Google Home or Amazon Echo Click To Tweet In this screenless world of spatial computing, interfaces will need to become more intuitive, efficient, and empathic Click To Tweet
Immersive experiences like those afforded by the HoloLens offer a tantalizing glimpse of where this is all headed: a screenless world where our natural human interactions plug into the digital realm, creating an entirely new form of hybrid reality. In fact, Gartner predicts that by 2020, 30 percent of web browsing will be done without a screen. The next technology revolution will usher in the era of spatial computing, where multisensory experiences allow us to interact with both the real and digital worlds through natural, intuitive interfaces such as haptics, limb and eye-tracking, and even elements such as taste and scent.
In this screenless world of spatial computing, interfaces will need to become more intuitive, efficient, and empathic. Let’s take a look at three ways in which voice technologies are already enabling this.
1 – Intuitive UX
Spatial audio and AI-driven voice technologies are crucial elements for creating compelling immersive experiences. As Kai Havukainen, Head of Product at Nokia Technologies explained in an interview for Scientific American, “building a dynamic soundscape is essential for virtual experiences to really engender a sense of presence.” Humans, he added, are simply hardwired to pay attention to sound and instinctively use it to map their surroundings, find points of interest and assess potential danger.
There are, however, design considerations that must be taken into account when tackling the challenges of an entirely new medium together with these fast-evolving technologies.
Tim Stutts, Interaction Design Lead at Magic Leap, highlights the sheer complexity of these UX challenges, “A level of complexity is added with voice commands, as the notion of directionality becomes abstract—the cursor for voice is effectively the underlying AI used to determine the intent of a statement, then relate it back to objects, apps and system functions.”
“For voice experiences, you need to have a natural language interface that performs well enough to understand different accents, dialects, and languages,” adds Mark Asher, director of corporate strategy at Adobe, who believes the advancement of voice technologies will serve to “bring the humanity back to computing,”
There are still many hurdles to overcome before we reach that utopian vision, however. As we move towards more pervasive and complex experiences where users have multiple applications open at the same time, they will need to circumvent problems such as unintentionally commanding a hologram when you’re actually talking to the person next to you.Spatial audio and AI-driven voice technologies are crucial elements for creating compelling immersive experiences Click To Tweet
Yet looking at the exponential way AI technologies have developed over the past decade, it isn’t unreasonable to extrapolate that the next few years we will usher in real-time contextual applications that accurately identify and action commands based on accurate assessments of your surroundings (both real and virtual) your personal preferences, and even your biofeedback.
2 – Voice Biofeedback
XR technologies already deploy a multitude of sensors that enable collection of biofeedback, yet voice provides a rich vein of data which can be collected in a non-invasive way without the need for cumbersome wearables.
Apart from deliberately using commands to interact with the world around us, our voices provide the scope for AI to contextualize our XR experiences based on subconscious factors such as our mood and physical health. Cymatics – the name given to the process of visualizing soundwaves – gives us some insight into the depth and complexity of the unique patterns projected by our voice.XR technologies already deploy a multitude of sensors that enable collection of biofeedback Click To Tweet
To produce speech, the brain communicates with the Vagus Nerve and sends a signal to the larynx, which vibrates out stored information through the vocal cords. Since vocalization is entirely integrated within both our central (CNS) and autonomic nervous system (ANS), there is an established correlation between voice output and the impact of stress.
Researchers have been developing methods for voice stress analysis (VSA) and computerized stress detection and body scanning devices for many years. Companies such as Insight Health Apps already leverage this rich data to feed corrective waveforms and patterns back into the body in the form of “Quantum Biofeedback”.
3 – Bridging the Uncanny Valley
When I was first invited to test social VR platform Sansar, I was shown around some of its virtual worlds by Linden Lab’s CEO Ebbe Altberg. To this day, my lasting impression of that demo was how our interaction felt very natural in spite of us being 5,000 miles and several time zones apart (I was in London and he in San Francisco) and the fact that his avatar looked nothing like his real-world persona.
Not only did Ebbe’s digital self have the face of a woman, but that face was attached to a cartoony blimp-like dinosaur bodysuit. Still, when he spoke, the avatar’s lips, teeth and facial muscles synchronized to the sounds in a way that my brain registered as true. It demonstrated one of the peculiar things about designing virtual experiences: the malleability of “reality,” and the fact we are more willing to suspend our disbelief for some aspects of it than others. Hence an avatar’s appearance doesn’t matter nearly as much as these subconscious prompts that form the core of human interaction.
It was an interesting way of avoiding the pervasive problem known as the “uncanny valley” which describes that awkward sense of unease you feel when a character or avatar appears human-like yet “not quite there.” Speech Graphics developed the technology that creates this notoriously difficult-to-achieve illusion that an animated face is the source of the sound you hear. Their pipeline merges powerful speech analysis with procedural animation techniques. To achieve this, the algorithm replicates not only the movement of the lips, but also decodes from that speech the energy and emotion of the speaker.
“In the sound of speech there is a wealth of information about what the speaker was doing when he or she made the sound—including the movements of the mouth as it produced the sound, and the energetic state of the speaker, from which we can deduce facial expression. From syllables to scowls,” its website reads. And because it is a universal physical model, it works for any language and any type of character, from realistic humans to cartoonlike avatars.
As digital experiences move beyond the familiar constraints of screens, our modes of interaction with the digital world are also evolving. Paradoxically, that evolution is taking us back to basic and instinctual forms of natural human interaction, hence the enduring relevance of Clarke’s law. Technology will soon be sufficiently advanced so that it will become an invisible layer of our reality rather than a separate realm requiring special skills to access. And in that hybrid reality we will experience an entirely new type of magic.In the sound of speech there is a wealth of information about what the speaker was doing when he or she made the sound Click To Tweet
Tech Trends offers a broad range of Digital Consultancy services to guide companies, individuals and brands in effectively leveraging existing and emerging technologies in their business strategy.
Alice Bonasio is a XR and Digital Transformation Consultant and Tech Trends’ Editor in Chief. She also regularly writes for Fast Company, Ars Technica, Quartz, Wired and others. Connect with her on LinkedIn and follow @alicebonasio on Twitter.