A benchmark dataset for perceiving object spatial relationships in human and machine vision
Poster Presentation 23.341: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Scene Perception: Intuitive physics
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Georgina Woo1, Nancy Kanwisher1, RT Pramod1; 1MIT
Consistent with their importance for scene understanding and prediction, humans perceive physical relationships between objects like support, containment, and attachment, automatically and abstractly across changes in object shape, viewpoint and surface properties (Hafri et al., 2024). How do we do this? Here, we introduce a new benchmark dataset designed to test various computational models as hypotheses for how humans perceive object relationships. The dataset contains images representing five spatial relationships – support, contain, attach, occlude, and cover; each instantiated across variations in object identities, materials, viewpoints, and backgrounds. We measured human behavior (N = 40) on a matching task wherein participants chose between two alternatives (‘target’ and ‘lure’) depicting the same object spatial relationship as the ‘sample’ image. In all, we collected human behavior on 320 trials of this matching task where the sample image additionally differed from the target and lure images with respect to object shapes, viewpoints, and backgrounds. As expected, humans are highly reliable and accurate (accuracy: mean ± std = 97% ± 1.3%) in their judgments. We then evaluated various pre-trained models on the same task, including CNNs, vision transformers, and next-frame-prediction systems such as V-JEPA as candidate computational hypotheses for human perception of object spatial relationships. All tested models perform the task poorly, with the best model (ViT-B, trained on ImageNet) reaching an accuracy of only 67.5% (where chance is 50%). In ongoing work, we are i) expanding the benchmark to include dynamic video stimuli and ii) testing whether end-to-end training specifically on object relationships will suffice, or whether these tasks can only be performed via explicit modeling of the 3D structure of the scene.
Acknowledgements: NSF STC Award Number 2124136 to NK