Modeling shape-from-motion processing in the primate dorsoventral visual pathways with video-computable neural networks

Poster Presentation 33.427: Sunday, May 17, 2026, 8:30 am – 12:30 pm, Pavilion
Session: 3D Shape and Space Perception: Miscellaneous

Yoon Bai1 (yhb@mit.edu), Thomas O'Connell2, Ani Ayvazian-Hancock4, Hannah Maver1, Yoni Friedman3, Josh Tenenbaum2,3, James DiCarlo1,2; 1MIT McGovern Institute for Brain Research, 2MIT Department of Brain and Cognitive Sciences, 3MIT Computer Science & Artificial Intelligence Laboratory (CSAIL), 4MIT Division of Comparative Medicine

Perceiving shape from dynamic sensory data is a fundamental challenge for the visual system, enabling robust interaction with a dynamic world. How does the primate visual system compute object shape from dynamic motion cues? We investigated this question using textured objects, human behavioral experiments, and dorsoventral neural activity recorded from macaque monkeys. Stimuli (2520 videos, 400ms each) were rendered from abstract 3D shapes with a 50° rotation. Both the object and background were overlaid with one of five high-contrast textures, camouflaging the object shape in the absence of motion. Behavioral experiments show that humans are near-ceiling at matching motion-induced shape across textures. We used multi-electrode chronic arrays to record from three downstream regions in the ventral pathway (pIT, cIT, aIT) and two downstream regions in the dorsal pathway (7op, Tpt). Neural decoding revealed that motion-induced shape is represented in both ventral and dorsal visual streams, but shape readouts from dorsal regions show higher correlations to human behavior than shape readouts from ventral regions. To identify candidate computations that may underlie these capabilities, we evaluated a large set (N=1280) of video-computable neural network models as computational analogs of primate visual processing. These include ImageNet CNNs and ViTs, unsupervised image models, foundation image models, video classification models, and autoregressive video models. Autoregressive video models trained to predict missing spatiotemporal features in natural videos exhibit superior alignment with both ventral and dorsal neural responses and human behavior. Finally, we find that the models best predicting neural responses are also best at behaviorally matching shapes across textures. Taken together, our results show that neuronal activity in downstream dorsal regions contain motion-induced shape representations that are closely correlated to human behavior and identifies predictive processing of spatiotemporal regularities as a plausible computational motif for how the brain constructs geometric world models from motion cues.

Acknowledgements: Office of Naval Research (MURI N00014-21-1-2801)