VSS, May 13-18

Object Recognition: Models, reading

Talk Session: Monday, May 16, 2022, 8:15 – 9:45 am EDT, Talk Room 1
Moderator: Talia Konkle, Harvard University

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 2:22 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 8:15 am, 41.11

Parallel word reading revealed by fixation-related potentials

Joshua Snell1 (), Jeremy Yeaton2, Jonathan Mirault3, Jonathan Grainger3; 1Vrije Universiteit Amsterdam, 2University of California Irvine, 3Aix Marseille University & CNRS

During reading, does lexical processing occur for multiple words simultaneously? Cognitive science has yet to answer this prominent question. Recently it has been argued (Snell & Grainger, 2019, TiCS) that the issue warrants supplementing the field’s traditional toolbox (eye-tracking) with neuroscientific techniques. Indeed, according to the OB1- reader model, upcoming words need not impact oculomotor behavior per se, but parallel processing of these words must nonetheless be reflected in neural activity patterns. Here we combined EEG with eye-tracking, time-locking the neural window of interest to the fixation on target words in sentence reading. During these fixations, we manipulated the identity of the subsequent word so that it posed either a syntactically legal or illegal continuation of the sentence. In line with previous research, oculomotor measures were unaffected. Yet, syntax impacted brain potentials as early as 350 ms after the target fixation onset. As prior EEG studies show that syntactic processing unfolds approximately 600 ms into viewing a word, the presently observed timings support the notion of parallel word processing. We reckon that OB1-reader is a particularly promising platform for theorizing about the reading brain.

Acknowledgements: European Commission, grant H2020 MSCA-IF 833223

Talk 2, 8:30 am, 41.12

Connectivity constraints, viewing biases, and task demands within a bi-hemispheric interactive topographic network account for the layout of human ventral temporal cortex

Nicholas Blauch1 (), Marlene Behrmann1, David Plaut1; 1Carnegie Mellon University

Inferior temporal (IT) cortex of primates is topographically organized, with multiple large clusters of selectivity for different stimulus domains, including faces, bodies, and scenes, organized along a medial-lateral axis corresponding to the peripheral-foveal layout of earlier retinotopic cortex. In the homologous ventral temporal cortex (VTC) of humans, additional lateral word selectivity is seen, with a relative hemispheric left-lateralization that mirrors the relative right-lateralization of face selectivity. How does this topographic organization emerge, and what factors govern its consistent global layout? Recent computational modeling work using Interactive Topographic Networks has demonstrated that learning under biological constraints on the spatial cost and sign of connections within IT/VTC cortex is sufficient to produce domain-selective clusters. Here, we test whether additionally constrained connectivity with early retinotopic areas and with downstream non-visual areas, in combination with domain-biased viewing conditions and task demands, produces the global layout of human VTC in a bi-hemispheric model. Retinotopic constraints are modeled by adding a spatial cost on feedforward connections from the polar-coordinate convolutional retinotopy of V4 into posterior VTC within each hemisphere of the model. Viewing conditions are modeled as distributions of relative image size and fixation likelihood, with realistic domain-specific parameters. Downstream language demands are modeled by an additional left-lateralized “language” system with connectivity restricted to model LH anterior VTC. Learning in the model accounts for 1) the retinotopically-constrained layout of domain-selectivity for words, faces, objects, and scenes along a lateral-medial or foveal-peripheral axis, and 2) hemispheric organization in which words are relatively left lateralized and, due to competition with words, faces are relatively right lateralized. Our work instantiates the most complete computational model of human VTC topography to date, and paves the way for future work incorporating a dorsal stream, ventral-dorsal interactions, and more detailed downstream task demands.

Talk 3, 8:45 am, 41.13

Mechanisms of human dynamic visual perception revealed by sequential deep neural networks

Lynn K. A. Sörensen1, Sander M. Bohté2, Heleen A. Slagter3, H. Steven Scholte1; 1University of Amsterdam, 2Centrum Wiskunde & Informatica, 3Vrije Universiteit Amsterdam

Our visual world and its perception are dynamic. Rapid serial visual presentation (RSVP) — a task in which observers see rapid sequences of natural scenes — is an example of such dynamic sequential visual stimulation. Remarkably, humans are still able to recognise scenes when images are shown as briefly as 13 ms/image. This feat has been attributed to the computational power of the first feedforward sweep in sensory processing. In contrast, slower presentation durations (linked to better performance) have been suggested to increasingly engage recurrent processing. Yet, the computational mechanisms governing human sequential object recognition remain poorly understood. Here, we developed a class of deep learning models capable of sequential object recognition. Using these models, we compared different computational mechanisms: feedforward and recurrent processing, single and sequential image processing, as well as different forms of rapid sensory adaptation. We evaluated how these mechanisms perform on an RSVP task, and to what extent they explain human behavioural patterns (N=36) across varying presentation durations (13, 40, 80 ms/image). We found that only models that integrate images sequentially via lateral recurrence captured human performance levels across different presentation durations. Such sequential models also displayed a temporal correspondence to single-trial performance, with few model steps best explaining human behaviour for the fastest durations and vice versa. Importantly, this temporal correspondence was achieved without reducing the model’s overall explanatory power. Finally, augmenting this sequential model with a power-law adaptation mechanism was essential to provide a plausible account of how neural processing obtains informative representations based on the briefest visual stimulation. Taken together, these results shed new light on how local recurrence and adaptation jointly enable object recognition to be as fast and effective as required by a dynamic visual world.

Acknowledgements: This work was funded by a Research Talent Grant (406.17.554) from the Dutch Research Council (NWO) awarded to all authors.

Talk 4, 9:00 am, 41.14

A brain-inspired object-based attention network for multi-object recognition and visual reasoning

Hossein Adeli1 (), Seoyoung Ahn1, Gregory Zelinsky1,2; 1Department of Psychology, Stony Brook University, 2Department of Computer Science, Stony Brook University

To achieve behavioral goals, the visual system recognizes and processes the objects in a scene using a sequence of selective glimpses, but how is this attention control learned? Here we present an encoder-decoder model that is inspired by the interacting visual pathways making up the recognition-attention system in the brain. The encoder can be mapped onto the ventral ‘what’ processing, which uses a hierarchy of modules and employs feedforward, recurrent, and capsule layers to obtain an object-centric hidden representation for classification. The object-centric capsule representation feeds to the dorsal ‘where’ pathway, where the evolving recurrent representation provides top-down attentional modulation to plan subsequent glimpses (analogous to fixations) to route different parts of the visual input for processing (with the encoding and decoding steps taken iteratively). We evaluate our model on multi-object recognition (highly overlapping digits, digits among distracting clutter) and visual reasoning tasks. Our model achieved 95% accuracy on classifying highly overlapping digits (80 percent overlap between bounding boxes) and significantly outperforms the Capsule Network model (<90%) trained on the same dataset while having a third of the number of parameters. Ablation studies show how recurrent, feedforward and glimpse mechanisms contribute to the model performance in this task. In a same-different task (from the Synthetic Visual Reasoning Tasks benchmark), our model achieved near-perfect accuracy (>99%), similar to ResNet and DenseNet models (outperforming ALexNet, VGG and CORnets) on comparing two randomly generated objects. On a challenging generalization task where the model is tested on stimuli that are different from the training set, our model achieved 82% accuracy outperforming bigger ResNet models (71%), demonstrating the benefit of a contextualized recurrent computation paired with an object-centric attention mechanism glimpsing the objects. Our work takes a step towards more biologically plausible architectures by integrating recurrent object-centric representation with the planning of attentional glimpses.

Talk 5, 9:15 am, 41.15

Neural and computational evidence that category-selective visual regions are facets of a unified object space

Jacob S. Prince1 (), Talia Konkle1; 1Harvard University

How do cortical regions with selectivity for faces, bodies, words, and scenes relate to one another, and to surrounding occipitotemporal cortex? A prominent theory is that they are independent and highly-specialized regions for processing their preferred domains, but the existence of systematic structure in their responses to non-preferred categories challenges this strict domain-specific account. To probe the response structure of these regions at unprecedented scale, we use the newly-released Natural Scenes Dataset, containing high-resolution fMRI responses to thousands of common object images. Considering 12 regions with selectivity for faces, bodies, word forms, and scenes, we find correlated representational geometry between all pairs of ROIs (mean r=0.46 over independent runs, sd=0.17), even for regions with anticorrelated univariate response profiles (e.g. FFA-1 vs. PPA, univariate r=-0.37, RDM r=0.43). These similar representational geometries suggest a shared representational goal unifying these regions, where univariate selectivity profiles highlight different discriminative feature axes of an integrated representational space for objects. Deep neural networks trained on multi-way object recognition directly instantiate this theory, as they operationalize a rich discriminative object space, without any specialized mechanisms for particular domains. We find that (i) there are naturally-emerging subsets of model units with selectivity for each of these domains; (ii) by re-weighting these selective units, we can predict both univariate and multivariate response structure in the corresponding category-selective regions, in some cases approaching the inter-subject noise ceiling (average max-layer univariate predictivity, r=0.46 in 515 held-out images; average RDM predictivity: r=0.39 in >10e5 pairwise comparisons); and that (iii) the unified representational space of the whole layer, considering all units, can predict responses in the macro-scale OTC sector (univariate max r=0.34, RDM r=0.60). These converging results offer strong empirical support at scale for the emerging theoretical view that category-selective regions are facets of a unified map of object space along occipitotemporal cortex.

Acknowledgements: This research was supported by NSF CAREER BCS-1942438.

Talk 6, 9:30 am, 41.16

Intuiting machine failures

Makaela Nartker1 (), Zhenglong Zhou2, Chaz Firestone1; 1Johns Hopkins University, 2University of Pennsylvania

A key ingredient of effective collaborations is knowing the strengths and weaknesses of one’s collaborators. But what if one’s collaborator is a machine-classification system, of the sort increasingly appearing in semi-autonomous vehicles, radiologists’ offices, and other contexts in which visual processing is automated? Such systems have both remarkable strengths (including superhuman performance at certain classification tasks) and striking weaknesses (including susceptibility to bizarre misclassifications); can naive subjects intuit when such systems will succeed or fail? Here, five experiments (N=900) investigate whether humans can anticipate when machines will misclassify natural images. E1 showed subjects two natural images: one which reliably elicits misclassifications from multiple state-of-the-art Convolutional Neural Networks, and another image which reliably elicits correct classifications. We found that subjects could predict which image was misclassified on a majority of trials. E2 and E3 showed that subjects are sensitive to the nature of such misclassifications; subjects’ performance was better when they were told what the misclassification was (but not which image received it), and worse when the label shown was from another, randomly-chosen category. Crucially, in E4, we asked subjects to either (a) choose which image they thought was misclassified, or (b) choose the image that is the worst example of its category. While both instructions resulted in subjects choosing misclassified images above chance, subjects who were instructed to identify misclassifications performed better. In other words, humans appreciate the cues that mislead machines, beyond simply considering the prototypicality of an image. Lastly, E5 explored more naturalistic settings. Here, instead of an NAFC choice, subjects identified potential misclassifications from a stream of individual images; even in this setting, subjects were remarkably successful in anticipating machine errors. Thus, humans can anticipate when and how their machine collaborators succeed and fail, a skill that may be of great value to human-machine teams.