3D Shape and Space Perception

Talk Session: Monday, May 18, 2026, 8:15 – 9:45 am, Talk Room 2
Moderator: Michele Rucci, University of Rochester

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

This session was recorded. Log In to view the video.

Talk 1, 8:15 am, 41.21

Interocular suppression weakens binocular disparity integration in the dorsal visual stream

Rong Jiang¹, Ming Meng²; ¹School of Psychology, South China Normal University, Guangzhou, China, ²Department of Neurobiology, The University of Alabama at Birmingham, Birmingham, USA

Binocular vision includes two complementary processes: integrating binocular disparity signals and suppressing interocular conflicts. Behavioral studies indicate that these processes influence each other, but the underlying neural mechanisms remain unclear. Here, we used fMRI and multivariate decoding to investigate how interocular suppression affects disparity representations in the human dorsal visual stream. Participants performed a central fixation task in the scanner. We manipulated the central visual field (eccentricity 0.4°–2°) to induce binocular fusion or rivalry, and measured neural responses to near vs. far disparities in correlated or anticorrelated random-dot stereograms in the peripheral region (eccentricity 2.5°–3.5°), yielding four conditions: correlated fusion (CF), correlated rivalry (CR), anticorrelated fusion (AF), and anticorrelated rivalry (AR). Within-condition decoding revealed robust near–far discrimination for CF across V1–V3, V3A, and IPS. Rivalry (CR) significantly reduced decoding accuracy in V3, V3A, and IPS, while decoding for anticorrelated conditions (AF, AR) dropped to chance. Cross-condition decoding showed that near–far decoding accuracy between CF and CR was significantly above chance level, suggesting that rivalry weakens, but does not remap, disparity representations. In V3A, cross-decoding accuracy between CF and AF was significantly below chance, indicating an inverted disparity code. This inverted disparity code was also disrupted by the interocular suppression process, as cross-decoding accuracy between CF and AR returned to chance level. Informational connectivity analyses revealed strong coordination among ROIs under CF. This coordination was significantly reduced within the V3A-centered network during CR, while connectivity within early visual areas (V1–V3) remained unchanged. Together, these results not only characterize how interocular suppression weakens disparity-selective signals and their large-scale coordination in the dorsal visual stream of typical human observers, but also provide insights that may guide future research on the neural mechanisms underlying stereoscopic deficits in binocular disorders.

Talk 2, 8:30 am, 41.22

The role of visuomotor contingency in stereoscopic depth perception

Jie Z. Wang¹ (jwang255@ur.rochester.edu), Y. Howard Li¹, Jonathan D. Victor², Michele Rucci¹; ¹University of Rochester, ²Weill Cornell Medical College

Eye movements occur continually during visual fixation and profoundly influence visual sensitivity. Yet their functions in 3-D visual processing have been little explored. Recent work has shown that the incessant inter-saccadic motion of the two eyes introduces temporal modulations in disparity signals that the visual system exploits to establish 3-D representations. Here we show that, in addition to the presence of disparity modulations, an important component of this process is the alignment between incoming disparity changes and motor expectations from eye movements. In a forced-choice procedure, observers reported whether the upper or lower half of a vertically slanted planar surface appeared closer. Stimuli were random-dot stereograms viewed through a stereoscope. Eye movements were continually measured with a digital DPI eye-tracker, and a custom apparatus for gaze-contingent display updated stimuli in real-time to allow for disruption of the natural visuomotor contingency. Trials alternated among three conditions. In the “Stabilized” condition, disparity modulations were eliminated by moving the stimuli with the eyes. In the “Reconstructed” condition, stimuli moved to counteract ongoing eye movements and recreate the retinal motion generated by previously recorded eye traces, yielding disparity modulations largely uncorrelated with current eye movements. In the “Flipped-vergence” condition, the sign of vergence-induced disparity was inverted and version-induced disparity was eliminated. Performance was greatly impaired under retinal stabilization, replicating previous findings. Discrimination improved in the Reconstructed condition, demonstrating that passive exposure to disparity modulations is beneficial, but does not suffice to support normal sensitivity. Strikingly, performance in the Flipped-vergence condition was even more impaired than under stabilization. These results indicate that stereopsis relies not only dynamic disparity signals, but also on their congruence with ongoing oculomotor signals. These findings point to a critical role of visuomotor contingency in depth perception.

Supported by NIH EY18363 and P30 EY001319

Talk 3, 8:45 am, 41.23

Benchmarking Human and DNN Biases in Monocular Depth Estimation

Yuki Kubota¹, Taiki Fukiage¹; ¹Communication Science Laboratories, NTT, Inc.

Human depth perception from a single image is systematically biased, yet the characteristics of these distortions remain insufficiently understood. Meanwhile, modern monocular depth estimation (MDE) models achieve high physical accuracy, raising a central question: to what extent do such models reproduce—or diverge from—human perceptual biases? To address this, we constructed two human-annotated depth datasets using established benchmarks: NYU (indoor scenes) and KITTI (outdoor scenes). These datasets enabled direct comparisons between human observers and 69 deep neural networks (DNNs), spanning diverse architectures, training strategies, datasets, and output formats. Human data were obtained by asking participants to report the absolute distances to four simultaneously marked target points in each photograph. Model accuracy was quantified using scale-invariant RMSE. Human–model similarity was defined as a partial correlation between model and human error patterns, obtained by repeatedly correlating fixed model errors with split-half averages of observer errors while controlling for ground-truth depth. We further applied an affine decomposition that isolates per-image affine distortions (scale, shift, horizontal shear, and vertical shear) from residual error. Across both datasets, humans showed robust and systematic deviations from physical ground truth, as indicated by high split-half human–human partial correlations of error patterns (NYU: 0.808; KITTI: 0.671). Examining the relationship between accuracy and human similarity revealed a clear pattern: similarity increased with accuracy up to approximately human-level performance, but declined for models surpassing that range—indicating a distinct accuracy–similarity trade-off. Notably, this trade-off was substantially more pronounced in the KITTI dataset. Overall, our findings demonstrate that human-like behavior in MDE does not emerge simply by improving metric accuracy. Instead, the divergence suggests that the strategies used by state-of-the-art DNNs, which potentially rely on dataset-specific cues to maximize precision, fundamentally differ from the generalized perceptual heuristics employed by human observers.

Talk 4, 9:00 am, 41.24

Final vision points you to the right (or wrong) place: a two-mode framework for depth-guided action

Carlo Campagnoli¹, Fulvio Domini²; ¹University of Leeds, ²Brown University

We present a re-analysis of a previous reaching study showing that the classic distinction between planning and online control does not reflect a shift from internal representation to sensory feedback, but rather a context-dependent selection between two environmentally constrained control modalities. Even when behaviour appears to express an internal estimate of target depth, the motor outcome remains determined by the constraints available at the moment the movement must be completed, not by the accuracy or persistence of any internal representation. And, critically, when such a representation is forced to operate, it is systematically wrong. In a first experiment, visual feedback was removed shortly before movement completion. Under these conditions (where no late correction was possible) endpoints reflected only the quality of the sensory information available earlier in the reach. Binocular viewing produced near-veridical performance, whereas monocular viewing generated a consistent overshoot. This pattern does not reveal a privileged “internal model” guiding action. Instead, it reveals a regime in which, absent usable feedback, the system is constrained to act on an impoverished and biased mapping of depth. The second experiment preserved visual feedback until the end and manipulated whether the final segment provided binocular or monocular information. Here, the behaviour previously attributed to an internal depth estimate vanished entirely. Instead, endpoints were dictated solely by the final visual configuration. Switching from monocular to binocular vision produced accurate reaches regardless of earlier viewing, whereas switching from binocular to monocular vision produced overshooting despite prolonged access to reliable disparity. In other words, the system did not rely on a stored internal depth estimate: it replaced it whenever environmental information permitted. Together, these findings challenge interpretations based on stable internal representations guiding action. Reaching behaviour appears to reflect contextual dominance of environmental constraints, not the fidelity of an internal depth estimate (accurate or otherwise).

Talk 5, 9:15 am, 41.25

Human-level 3D shape perception emerges from multi-view learning

Tyler bonnen¹ (tyler.ray.bonnen@gmail.com), Jitendra Malik¹, Angjoo Kanazawa¹; ¹UC Berkeley

Humans can infer the three-dimensional structure of objects. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we evaluate a novel class of neural networks that, for the first time, match human accuracy in 3D perception experiments. These models are trained with multi-view image sequences and corresponding self-motion cues—visual-spatial information analogous to human sensory inputs in natural environments. To evaluate this novel modeling approach we leverage an existing 3D perception benchmark (MOCHI), which reveals a considerable gap between humans and standard computer vision models using a concurrent visual discrimination ('oddity') task. We determine the performance of these multi-view models by developing a zero-shot evaluation approach, then compare to human (n=350) responses to the same images. These multi-view models match human-level 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent model readouts predict human error patterns and reaction times, revealing an emergent correspondence between model dynamics and human perceptual processing. Our work introduces a novel modeling framework to formalize and evaluate theories of human visual perception, demonstrating that human-level 3D abilities emerge in neural networks trained with naturalistic visual-spatial data.

This work is supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health (Award Number F99NS125816)

Talk 6, 9:30 am, 41.26 Remote Presentation

MRD: Using physically based differentiable rendering to probe vision models for 3D scene understanding

Benjamin Beilharz¹, Thomas S.A. Wallis²; ¹Technical University of Darmstadt, ²Center for Mind, Brain and Behavior (CMBB), Universities of Marburg, Giessen, and Darmstadt

While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce \textbf{MRD} (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models’ implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. We assess multiple models in their ability to recover the scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions can make clear the physical scene attributes that models are sensitive or invariant to. MRD holds promise for advancing our understanding of both computer and human vision, enabling us to efficiently answer the question of how physical scene parameters cause changes in model responses.

Funded by the European Union (ERC, SEGMENT, 101086774). Views and opinions expressed are however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

Vision Sciences Society

3D Shape and Space Perception

Interocular suppression weakens binocular disparity integration in the dorsal visual stream

The role of visuomotor contingency in stereoscopic depth perception

Benchmarking Human and DNN Biases in Monocular Depth Estimation

Final vision points you to the right (or wrong) place: a two-mode framework for depth-guided action

Human-level 3D shape perception emerges from multi-view learning

MRD: Using physically based differentiable rendering to probe vision models for 3D scene understanding

Important Dates

MyVSS

Join VSS

Future Meetings