Visual angle and image context alter the alignment between deep convolutional neural networks and the macaque ventral stream

Poster Presentation 43.312: Monday, May 22, 2023, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models

Sara Djambazovska1,2 (), Gabriel Kreiman2, Kohitij Kar3; 1Swiss Federal Institute of Technology, Lausanne (EPFL), 2Harvard Medical School, 3York University

A family of deep convolutional neural networks (DCNNs) currently best explains primate ventral stream activity that supports object recognition. Such models are often evaluated with neurobehavioral datasets where the stimuli are presented in the subjects’ central field of view (FOV). However, the exact visual angle often varies widely across studies (e.g., 8 degrees for Yamins et al., 2014; 2.9 degrees for Khaligh-Razavi et al., 2014; catered to V1 neuronal receptive field, 2 degrees for Cadena et al., 2019). A unified model of the primate visual system cannot have a varying FOV. Similarly, the type of images used for model evaluation vary across studies, ranging from objects embedded in randomized contexts (Yamins et al., 2014) to objects with no contexts (Khaligh-Razavi et al., 2014). Here we systematically tested how the predictivity of macaque inferior temporal (IT) neurons by DCNNs depends on the FOV and the image-context. We used images (“full-context”) from the Microsoft COCO imageset. We performed large-scale recordings in one macaque (~100 IT sites) while the monkey passively fixated images presented at 20 and 30 degrees. To estimate the optimal FOV for the DCNNs, we compared the DCNN IT predictivity at varying image crop sizes. We observed that ~ 8-10 visual degree crops produced the strongest DCNN IT predictions. Next, to test the effect of image-context, we generated two versions of the original images: object only (“no- context”) and swapped backgrounds (“incongruent-context”). DCNN's IT predictivity was significantly lower for “incongruent-context” compared to the “no/full-context” images. Interestingly, we observed a more significant gap between early (90-120ms) and late (150 -180ms) response predictivity for “incongruent-context” compared to “no/full-context” images, suggesting stronger putative feedback signals during such contextual manipulations. In sum, our results provide critical constraints to guide the development of more brain-aligned DCNN models of the primate vision.

Acknowledgements: KK was supported by the Canada Research Chair Program. This research was undertaken thanks in part to funding from the Canada First Research Excellence Fund. SD was supported by the Bertarelli Foundation. We also thank Jim DiCarlo, and Sarah Goulding for their support.