Task-Driven Recurrent Demands Reduce ANN Alignment with Primate IT
Poster Presentation 56.422: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: Models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Ezgi Fide1 (ezgifide@yorku.ca), Sabine Muzellec2, R. Shayna Rosenbaum1, Kohitij Kar2; 1York University, Department of Psychology and Centre for Vision Research, Centre for Integrative and Applied Neuroscience, Toronto, Canada, 2York University, Department of Biology and Centre for Vision Research, Centre for Integrative and Applied Neuroscience, Toronto, Canada
Are we overlooking crucial insights by evaluating current artificial neural network (ANN) models of primate vision primarily based on neural responses recorded during passive stimulus viewing? Much of the field’s recent progress relies on widely used benchmarks, such as Brain-Score, Natural Scenes Dataset, and THINGS, which assess model-brain similarity using data collected while subjects passively view images. Yet real-world vision is goal-directed, suggesting that current benchmarking practices may overlook critical task-dependent computations. To address this gap, we recorded neural activity from the inferior temporal (IT) cortex of two macaque monkeys (n=155 and 109 reliable sites) during both passive viewing and an active object discrimination task, using the same set of 1320 images. We then compared the accuracy of 38 state-of-the-art ANNs in predicting IT responses across task conditions. Given these models' optimization for object recognition, we expected neural predictivity to be higher during active task engagement. However, cross-predictivity, ANNs predicting IT (forward predictivity, FP), and IT predicting ANNs (reverse predictivity, RP) were modest but significantly higher during passive than active viewing (FP: t(37)=15.8, p<0.001; RP: t(37)=-0.196, p=0.577). This reduction in predictivity was substantially greater for late (170-200ms) than for early (70-100ms) phase responses (FP:t(37)=61.6, p<0.001; RP:t(37)=15.1, p<0.001), consistent with models' known limitations in capturing recurrent computations. In contrast, the IT responses measured during active task engagement better predicted behavioral accuracy patterns across images. Moreover, transforming ANN activations using the learned mapping between passive and active IT responses significantly improved the ANN’s behavioral predictivity, suggesting that ANN representations lack task-dependent components present in IT. The modest discrepancies between active and passive conditions affirm the value of large-scale passive datasets, which capture core feedforward IT structure for high-throughput benchmarking. Rather than rejecting this approach, our results highlight task-related gaps and the need to model task-driven recurrent computations in next-generation vision models.
Acknowledgements: KK is supported by funds from the Canada Research Chair Program (CRC-2021-00326), NSERC (RGPIN-2024-06223). EF is supported by the TBS Grant (York University). RSR acknowledges support from a York Research Chair, and NSERC. The CFREF VISTA and Connected Minds programs support KK, EF, SM, and RSR.