VSS, May 13-18

Object Recognition: Features, categories, preferences

Talk Session: Wednesday, May 18, 2022, 8:15 – 10:00 am EDT, Talk Room 2
Moderator: Afraz Arash, NIMH

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 3:18 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 8:15 am, 61.21

Perceptual anisotropies across the central fovea

Samantha Jenks1 (), Martina Poletti1,2,3; 1Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA, 2Department of Neuroscience, University of Rochester, Rochester, NY, USA, 3Center for Visual Science, University of Rochester, Rochester, NY, USA

It’s well established that vision in the periphery and parafovea is characterized by asymmetries; humans are better at discriminating items along the horizontal meridian compared to the vertical meridian. Similarly, sensitivity in the lower visual field is better than in the upper visual field. Current evidence shows that the extent of these asymmetries decrease with eccentricity, suggesting that they may be absent in the central 1deg fovea. However, due to technical limitations this has never been examined. Thanks to high-precision eyetracking and a gaze contingent display control allowing for more accurate localization of gaze, we probed fine visual discrimination at different isoeccentric locations across the foveola and compared it with corresponding locations in the periphery. Participants (n=10) performed a two-alternative forced-choice discrimination task while maintaining fixation on a central marker. Performance was tested at 8 locations, approximately 20 arcmin from the preferred locus of fixation. The same task was replicated at 4.5 degrees eccentricity (n=7) and the stimuli size was adjusted to account for cortical magnification. Our results show that, similarly to what happens in the visual periphery, humans are more sensitive to stimuli presented along the horizontal than the vertical foveal meridian. While the magnitude of this asymmetry across the meridians is smaller in the fovea than extrafoveally, the magnitude of asymmetry along the vertical meridian is equal. Furthermore, foveal asymmetry on this meridian is flipped compared to what is found extrafoveally: objects in the upper foveal meridian are discerned more easily than those in the lower meridian. These findings show that even foveal vision is characterized by perceptual anisotropies and that their characterization is in part different from what is found in the rest of the visual field. Furthermore, while some asymmetries are larger extrafoveally, others are present to the same extent at both scales.

Acknowledgements: This work was funded by NIH R01 EY029788-01 grant and NSF Award 1449828 NRT-DESE Graduate Training in Data-Enabled Research into Human Behavior and its Cognitive and Neural Mechanisms

Talk 2, 8:30 am, 61.22

Understanding the invariances of visual features with separable subnetworks

Christopher Hamblin1 (), Talia Konkle1, George Alvarez1; 1Harvard University

Visual feature detectors that are useful for high-level semantic tasks must often be invariant to differences in the input space, but how such invariant feature detectors are constructed through image-computable operations is a fundamental and poorly understood challenge. Deep convolutional neural networks have the potential to provide insight into this puzzle, as invariant feature tuning often emerges in the latent spaces of such networks — but, how? Here we present a novel pruning method we call 'feature splitting', which can split a single CNN feature into multiple, sparse subnetworks, each of which only preserves its tuning response to a selection of inputs. We focus on polysemantic units, which respond strongly and selectively to seemingly unrelated semantic categories (e.g., monkey faces and written text) as a case study for splitting a feature across its invariance structure. While a few examples of polysemantic units have been characterized in DNNs, here we develop a data-driven method for identifying polysemantic units in the network. Then, we extract multiple sparse subnetworks, each of which only preserves the feature’s response to a targeted subset of image patches (e.g., to monkey faces, or to written text). In such instances, we find our feature-splitting algorithm returns highly separable subnetworks, with few shared weights between them. These findings indicate that the tuning of polysemantic units draws largely on highly distinct image filtering processes, acting as an ‘or’ gate by summing the outputs of these processes. Broadly, these feature-splitting methods introduce a principled approach for dissecting a wide range of invariance structures necessary for high-level feature detection (e.g. over units that respond to both profile and frontal views of faces, or to objects presented at different scales in the image), isolating the separable and shared computations underlying invariance.

Talk 3, 8:45 am, 61.23

Low and high spatial frequencies contribute equally to rapid threat detection when contrast is normalized

Claudia Damiano1 (), Chrissy Engelen1, Johan Wagemans1; 1KU Leuven

Humans and non-human primates are able to detect threatening stimuli (e.g., snakes), faster than non-threatening stimuli. It is thought that this ability operates on low spatial frequency (LSF) information. However, a natural image has higher contrast at low spatial frequencies than at high spatial frequencies. This means that when an image is frequency filtered to retain either LSF or HSF information, the LSF images have higher contrast than HSF images, making the LSF images more visible to the eye. Thus, it is unclear whether rapid threat detection truly relies on LSF information or simply on the greater visibility of the higher contrast stimuli. Previous studies have failed to isolate the spatial frequency information completely by not contrast-normalizing the frequency filtered stimuli. In the current study, we ran a rapid threat detection experiment (N = 39) using HSF and LSF versions of threatening (snakes, wasps) and non-threatening (salamanders, flies) animal images that were either contrast-normalized after filtering, or not contrast-normalized. Results revealed higher accuracy for distinguishing between threatening and non-threatening animals with LSF images (accuracy = 62.8%) compared to HSF images (49.0%; p < 0.001) when the images were not contrast-normalized. Critically, when the images were contrast-normalized after being frequency filtered, there was no difference in accuracy between LSF and HSF images (accuracy = 64.1% vs. 64.4%, p = 0.92). This work has important implications for the field of threat detection, since much of the work in this field is based on the idea that LSF information is what allows people to make rapid threat judgements and other emotional appraisals. Our findings call this idea into question and are an important reminder that to make a true claim about the role of spatial frequency information, one must isolate spatial frequency from other low-level properties such as contrast and luminance.

Talk 4, 9:00 am, 61.24

Efficiently-generated object similarity scores predicted from human feature ratings and deep neural network activations

Martin N Hebart1, Philipp Kaniuth1, Jonas Perkuhn1; 1Max Planck Institute for Human Cognitive & Brain Sciences

A central aim in vision science is to elucidate the structure of human mental representations of objects. A key ingredient to this endeavor is the assessment of psychological similarities between objects. A challenge of current methods for exhaustively sampling similarities is their high resource demand, which grows non-linearly with the number of stimuli. To overcome this challenge, here we introduce an efficient method for generating similarity scores of real-world object images, using a combination of deep neural network activations and human feature ratings. Rather than directly predicting similarity for pairs of images, our method first predicts each image’s values on a set of 49 previously established representational dimensions (Hebart et al., 2020). Then, these values are used to generate similarities for arbitrary pairs of images. We evaluated the performance of this method using dimension predictions derived from the neural network architecture CLIP-ViT as well as direct human ratings of object dimensions collected through online crowdsourcing (n = 25 per dimension). Human ratings were collected on a set of 200 images, and generated similarity was evaluated on two separate sets of 48 images. CLIP-ViT performed very well at predicting global similarity for 1,854 objects (r = 0.89). Applying CLIP-ViT predictions to three existing neuroimaging datasets rivaled and often even outperformed previous existing behavioral similarity datasets. For the 48 image sets, both humans and CLIP-ViT provided good predictions of image dimension values across several datasets, leading to very good predictions of similarity (humans: R2 = 74-77%, CLIP-ViT: R2 = 76-82% explainable variance). Combining dimension predictions across humans and CLIP-ViT yielded a strong additional increase in performance (R2 = 84-87%). Together, our method offers a powerful and efficient approach for generating similarity judgments and opens up the possibility to extend research using image similarity to large stimulus sets.

Acknowledgements: This work was supported by a research group grant of the Max Planck Society awarded to M.N.H.

Talk 5, 9:15 am, 61.25

Human visual cortex as a texture basis set for object perception

Akshay Vivek Jagadeesh1,2, Justin L Gardner1,2; 1Department of Psychology, Stanford University, 2Wu Tsai Neurosciences Institute, Stanford University

Humans can easily and quickly identify objects, an ability thought to be supported by category-selective regions in lateral occipital cortex (LO) and ventral temporal cortex (VTC). However, prior evidence for this claim has not distinguished whether category-selective regions represent objects or simply represent complex visual features regardless of spatial arrangement, i.e. texture. If category-selective regions directly support object perception, one would expect that human performance discriminating objects from textures with scrambled object features would be predicted by the representational geometry of category-selective regions. To test this claim, we leveraged an image synthesis approach that provides independent control over the complexity and spatial arrangement of visual features. In a conventional categorization task, we indeed find that BOLD responses from category-selective regions predict human behavior. However, in a perceptual task where subjects discriminated real objects from synthesized textures containing scrambled features, visual cortical representations failed to predict human performance. Whereas human observers were highly sensitive in detecting the real object, visual cortical representations were insensitive to the spatial arrangement of features and were therefore unable to identify the real object amidst feature-matched textures. We find the same insensitivity to feature arrangement and inability to predict human performance in a model of macaque inferotemporal cortex and in Imagenet-trained deep convolutional neural networks. How then might these texture-like representations support object perception? We found that an image-specific linear transformation of visual cortical responses yielded a representation that was more selective for natural feature arrangement, demonstrating that the information necessary to support object perception is accessible, though it requires additional neural computation. Taken together, our results suggest that the role of visual cortex is not to explicitly encode a fixed set of objects but rather to provide a basis set of texture-like features that can be infinitely reconfigured to flexibly identify new object categories.

Talk 6, 9:30 am, 61.26

Categorization-dependent dynamic representation, selection and reduction of stimulus features in brain networks

Yaocong Duan1 (), Robin Ince1, Joachim Gross12, Philippe Schyns1; 1School of Psychology and Neuroscience, University of Glasgow, 2Institute for Biomagnetism and Biosignalanalysis, University of Muenster, Germany

A single image can afford multiple categorizations, each resulting from brain networks specifically processing the features relevant to each task. To understand where, when and how brain networks selectively process these features, our experiment comprised four different 2-Alternative-Forced-Choice (2-AFC) categorizations of the same original 64 images of a realistic city street comprising varying embedded targets: a central face (male vs. female; happy vs. neutral), left flanked by a pedestrian (male vs. female), right flanked by a parked vehicle (car vs. SUV). Bubbles randomly sampled each image to generate 768 stimuli. In a within-participant design (N = 10), each performed the four tasks in four blocks on the same 768 stimuli twice repeated in a random order. We concurrently recorded their categorization responses and source-localized MEG activity. We reconstructed the features each participant used in each task--computed as Mutual Information(Pixel visibility; Correct vs. Incorrect). We show (1) that each task incurs usage of task-specific features in each participant (e.g. body part in pedestrian gender vs. vehicle component in vehicle) and (2) that even the same categorization (e.g. pedestrian gender) incurs usage of different features across participants (e.g. upper vs. lower body parts). Critically, brain networks adaptively changed their representations of the same features into the activity of the same MEG sources when the task makes them relevant, or not--computed as Synergy(Feature visibility; MEG; Categorization tasks). When task-relevant, features are each selected from occipital to higher cortical regions for categorization; When task-irrelevant, each is quickly reduced into occipital cortex [<170ms]. Reconstructed network connectivity shows communications of only task-relevant features from sending occipital cortex [50-100 ms] to receiving right Fusiform Gyrus [100-130 ms]. All results are replicated in 10/10 participants to uniquely demonstrate where, when and how their brain networks dynamically select vs. reduce stimulus features to accomplish multiple categorization behaviors.

Acknowledgements: P.G.S. received support from the Wellcome Trust (Senior Investigator Award, UK; 107802) and the MURI/Engineering and Physical Sciences Research Council (USA, UK; 172046-01). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Talk 7, 9:45 am, 61.27

Similarities and differences in the spatio-temporal neural dynamics underlying the recognition of natural images and line drawings

Johannes Singer1,2 (), Radoslaw Martin Cichy2, Martin N Hebart1; 1Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany, 2Free University Berlin, Germany

Humans effortlessly recognize line drawings of objects, indicating the robustness of our visual system to the abstraction of substantial amounts of visual information. Previous work has demonstrated that recognition of line drawings engages the same brain regions that support natural object recognition. Yet, it remains unknown if the spatial and temporal representational dynamics of object recognition are similar for photos and drawings, or, alternatively, whether distinct mechanisms are recruited for drawings, leading to different representational structure across space and time. To address this question, we collected MEG (N=22) and fMRI (N=23) data while participants passively viewed the same object images depicted as either photographs, line drawings or sketch-like drawings - with each type of depiction representing one level of visual abstraction. Using multivariate pattern analysis, we demonstrate that, regardless of the level of visual abstraction, information about the category of an object can be read out from MEG data rapidly after stimulus onset. For the fMRI data we found significant above-chance decoding accuracies in overlapping parts of the occipital and ventral-temporal cortex for all types of depiction. In addition, object category information generalized strongly between types of depiction, beginning already in early visual processing and persisting in later processing stages. MEG-fMRI fusion based on representational similarity analysis revealed a largely similar spatio-temporal pattern for all types of depiction, first reaching early visual cortex and later high-level object-selective regions. Despite these similarities, photos showed overall stronger effects. Together, our findings reveal broad commonalities in the spatio-temporal representational dynamics of object recognition for natural images and drawings. These results constrain potential models of object recognition by demonstrating that the same mechanisms our brains use to resolve object recognition for natural object images may also hold for object drawings, from the earliest processing stages.

Acknowledgements: This work was supported by a Max Planck Research Group grant of the Max Planck Society awarded to MNH, The German Research Council grants (CI241/1-1, CI241/3-1 CI241/7-1) awarded to RMC, and a European Research Council grant (ERC-StG-2018-803370) awarded to RMC.