Facebook says it’s progressing toward assistants capable of interacting with and understanding the physical world as well as people do. The company announced milestones today implying its future AI will be able to learn how to plan routes, look around its physical environments, listen to what’s happening, and build memories of 3D spaces.
The concept of embodied AI draws on embodied cognition, the theory that many features of psychology — human or otherwise — are shaped by aspects of the entire body of an organism. By applying this logic to AI, researchers hope to improve the performance of AI systems like chatbots, robots, autonomous vehicles, and even smart speakers that interact with their environments, people, and other AI. A truly embodied robot could check to see whether a door is locked, for instance, or retrieve a smartphone that’s ringing in an upstairs bedroom.
“By pursuing these related research agendas and sharing our work with the wider AI community, we hope to accelerate progress in building embodied AI systems and AI assistants that can help people accomplish a wide range of complex tasks in the physical world,” Facebook wrote in a blog post.
While vision is foundational to perception, sound is arguably as important. It captures rich information often imperceptible through visual or force data like the texture of dried leaves or the pressure inside a champagne bottle. But few systems and algorithms have exploited sound as a vehicle to build physical understanding, which is why Facebook is releasing SoundSpaces as part of its embodied AI efforts.
SoundSpaces is a corpus of audio renderings based on acoustical simulations for 3D environments. Designed to be used with AI Habitat, Facebook’s open source simulation platform, the data set provides a software sensor that makes it possible to insert simulations of sound sources in scanned real-world environments.
SoundSpaces is tangentially related to work from a team at Carnegie Mellon University that released a “sound-action-vision” data set and a family of AI algorithms to investigate the interactions between audio, visuals, and movement. In a preprint paper, they claimed the results show representations from sound can be used to anticipate where objects will move when subjected to physical force.
Unlike the Carnegie Mellon study, Facebook says creating SoundSpaces required an acoustics modeling algorithm and a bidirectional path-tracing component to model sound reflections in a room. Since materials affect the sounds received in an environment, like walking across marble floors versus a carpet, SoundSpaces also attempts to replicate the sound propagation of surfaces like walls. At the same time, it allows the rendering of concurrent sound sources placed at multiple locations in environments within popular data sets like Matterport 3D and Replica.
In addition to the data, SoundSpaces introduces a challenge that Facebook calls AudioGoal, where an agent must move through an environment to find a sound-emitting object. It’s an attempt to train AI that sees and hears to localize audible targets in unfamiliar places, and Facebook claims it can enable faster training and higher-accuracy navigation compared with conventional approaches.