Unveiling task-dependent action affordance representations: Insights from scene-selective cortex and deep neural networks

Poster Presentation 26.471: Saturday, May 18, 2024, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Neural mechanisms

Clemens G. Bartnik1 (), Nikolina Vukšić2, Steven Bommer2, Iris I.A. Groen1,2; 1Video & Image Sense Lab, Informatics Institute, University of Amsterdam, The Netherlands, 2Psychology Research Institute, University of Amsterdam Amsterdam, The Netherlands

Humans effortlessly know how and where to move in the immediate environment using a wide range of navigational actions, from walking and driving to climbing. Yet little is known about where and how action affordances are computed in the brain. Some work implicates scene-selective cortex in navigational affordance representation, reflecting visual features computed in mid-level DNN layers (Bonner et al., 2017, 2018), while others report a lack of affordance representation therein (Groen et al., 2018). Here, we curated a novel set of real-world scenes that afford distinct navigational actions in both indoor and outdoor environments, for which we collected rich behavioral annotations (N=152) for seven commonly used visual properties. The behavioral annotations indicate that navigational actions form a distinct space separate from representations of objects or materials; even in combination, visual properties explain only around 20% of the variance in navigational action annotations. We collected human fMRI measurements (N=20) to a subset of 90 images while subjects performed three distinct tasks (action affordance recognition, object recognition, and fixation). Using representational similarity analysis, we confirm that scene-selective brain regions, especially the Parahippocampal Place Area and Occipital Place Area, represent navigational action affordances. Furthermore, elevated behavioral correlations in scene-selective regions during action affordance and object recognition tasks relative to fixation suggests these representations are task-dependent. Unlike prior findings, however, we find that DNNs trained for scene and object classification poorly represent these action affordances. Interestingly, language-supervised models like Contrastive Language-Image Pre-training (CLIP) show enhanced predictions for behavior and brain activity, suggesting they capture affordance representation better. These findings strengthen evidence for action affordances in the scene-selective cortex and reveal their task dependency. However, the underlying computations remain elusive, but our work suggests that integrating semantic information in computational models of affordance perception is a promising direction.