Human-level 3D shape perception emerges from multi-view learning

Talk Presentation: Monday, May 18, 2026, 8:15 – 9:45 am, Talk Room 2
Session: 3D Shape and Space Perception

Tyler bonnen1 (), Jitendra Malik1, Angjoo Kanazawa1; 1UC Berkeley

Humans can infer the three-dimensional structure of objects. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we evaluate a novel class of neural networks that, for the first time, match human accuracy in 3D perception experiments. These models are trained with multi-view image sequences and corresponding self-motion cues—visual-spatial information analogous to human sensory inputs in natural environments. To evaluate this novel modeling approach we leverage an existing 3D perception benchmark (MOCHI), which reveals a considerable gap between humans and standard computer vision models using a concurrent visual discrimination ('oddity') task. We determine the performance of these multi-view models by developing a zero-shot evaluation approach, then compare to human (n=350) responses to the same images. These multi-view models match human-level 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent model readouts predict human error patterns and reaction times, revealing an emergent correspondence between model dynamics and human perceptual processing. Our work introduces a novel modeling framework to formalize and evaluate theories of human visual perception, demonstrating that human-level 3D abilities emerge in neural networks trained with naturalistic visual-spatial data.

Acknowledgements: This work is supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health (Award Number F99NS125816)