Characteristics of the emergence of category selectivity in convolutional neural networks

Poster Presentation 63.403: Wednesday, May 22, 2024, 8:30 am – 12:30 pm, Pavilion
Session: Object Recognition: Models

Niels J. Verosky1 (), Olivia S. Cheung1,2; 1New York University Abu Dhabi, 2Center for Brain & Health, NYUAD Research Institute

Convolutional neural networks (CNNs) have attained impressive performance on visual categorization. Are CNNs appropriate working models of the human visual system? We investigated how CNN performance might resemble human categorization of animacy in four critical aspects: 1) successful categorization of animals vs. objects independent of image statistics, 2) continuum of perceptual to conceptual processes, 3) early emergence of animal compared with object representations, and 4) stable performance across altered images, such as images filtered to contain only high or low spatial frequencies. We tested ResNet-50 with ImageNet pretraining or Contrastive Language-Image Pretraining (CLIP) to categorize grayscale images of animals and objects that were either of round or elongated overall shapes, where all images of the same overall shapes shared comparable image statistics. Each category contained 12-16 items (e.g., squirrel, dolphin), with 16 exemplars from each item. Low-level visual properties were controlled using the SHINE toolbox. We examined categorization accuracy for animals vs. objects of the CNNs, and used representational similarity analysis (RSA) to examine their internal representations. For RSA, each layer of the CNN representations of all items was compared with theoretical category-selective, shape-selective, animal-selective, and object-selective models. We found that, consistent with human performance, 1) both CNNs categorized the images at high accuracy (92-98%) in the absence of image statistics differences across categories, and formed category-selective representations towards the final layers, 2) the shape-selective representations arose prior to the category representations across the layers, 3) the animal-selective representations emerged from early layers and were stable across layers, whereas the object-selective representations appeared late. However, 4) CNN performance was dramatically impacted by spatial frequency changes: categorization accuracy dropped substantially (53-80%) and the internal representations became highly shape-selective throughout the layers. These results suggest that CNNs reflect strong similarities to human categorization, but are limited in generalization across spatial frequencies.

Acknowledgements: This work was funded by a New York University Abu Dhabi faculty grant (AD174) and a Tamkeen New York University Abu Dhabi Research Institute grant (CG012).