Organizers: Dirk B. Walther1, James Elder2; 1University of Toronto, 2York University
Presenters: James Elder, Thomas Serre, Anitha Pasupathy, Mary A. Peterson, Pieter Roelfsema, Dirk B. Walther
A principal challenge for both biological and machine vision systems is to integrate and organize the diversity of cues received from the environment into the coherent global representations we experience and require to make good decisions and take effective actions. Early psychological investigations date back more than 100 years to the seminal work of the Gestalt school. But in the last 50 years, neuroscientific and computational approaches to understanding perceptual organization have become equally important, and a full understanding requires integration of all three approaches. We understand perceptual organization as the process of establishing meaningful relational structures over raw visual data, where the extracted relations correspond to the physical structure and semantics of the scene. The relational structure may be simple, e.g., set membership for image segmentation, or more complex, for example, sequence representations of contours, hierarchical representations of surfaces, layered representations of scenes, etc. These representations support higher-level visual tasks such as object detection, object recognition, activity recognition and 3D scene understanding. This symposium will review the current state of perceptual organization research as well as open questions from a neuroscientific, psychophysical, and computational approach and highlight outstanding issues. Current feedforward computational models for object perception fail to account for the holistic nature of human object perception. A computational analysis of perceptual grouping problems leads to an alternative account that refines feedforward representations of local features with recurrent computations implementing global optimization objectives (James Elder). These principles can be seen in the recurrent computations leading to the formation of extra-classical receptive fields in early visual cortex. New neural network models of these recurrent circuits lead to emergent grouping principles of proximity and good continuation and demonstrate how recurrence leads to better contour detection and a more accurate account of human contour processing (Thomas Serre). These early contour representations are further integrated in mid-level stages of the ventral visual pathway to form object representations. A key challenge for perceptual organization is to accurately encode object shape despite occlusion and clutter. Behavioural and physiological results reveal that the visual system relies upon a competitive recurrent grouping-by-similarity computation to protect object encoding from the effects of crowding (Anitha Pasupathy). This kind of competitive computation also appears to be at the heart of figure/ground assignment, where convexity serves as a figural prior (Mary Peterson). While simple grouping operations may be achieved through a feedforward process, it will be argued that these more complex grouping operations are invoked through an incremental, attentive process that manifests as a more gradual spread of activation across visual cortex. (Pieter Roelfsema). To close, we show that local parallelism of contours leads to improved scene categorization as well as clearer representations of natural scenes in the human visual cortex. (Dirk B. Walther). Through these closely-related talks, the symposium will illustrate how integration of physiological, psychophysical and computational research has led to a better understanding of perceptual organization, and will highlight key open research questions and suggest directions for integrative research that will answer these questions.
The role of local and holistic processes in the perceptual organization of object shape
James Elder1; 1York University
Perceptual grouping is the problem of determining what features go together and in what configuration. Since this is a computationally hard problem, it is important to ask whether object perception really depends on perceptual grouping. For example, under ideal conditions, a collection of local features may be sufficient to classify an object. These features could be computed via a feedforward process, obviating the need for perceptual grouping. Indeed, this fast feedforward `bag of features’ conception of object processing is prevalent in both human and computer vision research. Here I will review psychophysical and computational research that challenges the ability of this class of model to explain object perception. Psychophysical assessment shows that humans are largely unable to pool local shape features to make object judgements unless these features are configured holistically. Further, the formation of these perceptual groups is itself found to rely on holistic shape representations, pointing to a recurrent circuit that conditions local grouping computations on this holistic encoding. While feedforward deep learning models for object classification are more powerful than earlier bag-of-feature models, we find that these models also fail to capture human sensitivity to holistic shape and perceptual robustness to occlusion. This leads to the hypothesis that a computational model designed to solve perceptual grouping tasks as well as object classification will form a better account of human object perception, and I will highlight how optimal solutions to these grouping tasks are typically based on a fusion of feedforward local computations with holistic optimization and feedback.
Recurrent neural circuits for perceptual grouping
Thomas Serre1; 1Brown University
Neurons in the visual cortex are sensitive to context: Responses to stimuli presented within their classical receptive fields (CRFs) are modulated by stimuli in their surrounding extra-classical receptive fields (eCRFs). However, the circuits underlying these contextual effects are not well understood, and little is known about how these circuits drive perception during everyday vision. We tackle these questions by approximating circuit-level eCRF models with a differentiable discrete-time recurrent neural network that is trainable with gradient-descent. After optimizing model synaptic connectivity and dynamics for object contour detection in natural images, the neural-circuit model rivals human observers on the task with far better sample efficiency than state-of-the-art computer vision approaches. Notably, the model also exhibits CRF and eCRF phenomena typically associated with primate vision. The model’s ability to accurately detect object contours also critically depends on these effects, and these contextual effects are not found in ablated versions of the model. Finally, we derive testable predictions about the neural mechanisms responsible for contextual integration and illustrate their importance for accurate and efficient perceptual grouping.
Encoding occluded and crowded scenes in the monkey brain: object saliency trumps pooling
Anitha Pasupathy1; 1University of Washington
I will present results from a series of experiments investigating how simple scenes with crowding and partial occlusion are encoded in midlevel stages of the ventral visual pathway in the macaque monkey. Past studies have demonstrated that neurons in area V4 encode the shape of isolated visual stimuli. When these stimuli are surrounded by distractors that crowd and occlude, shape selectivity of V4 neurons degrades, consistent with the decline in the animal’s ability to discriminate target object shapes. To rigorously test whether this is due to the encoding of “pooled” summary statistics of the image within the RF, we characterized responses and selectivity for a variety of target-distractor relationships. We find that the pooling model is a reasonable approximation for neuronal responses when targets and distractors are either all similar or all different. But when the distractors are all similar and can be perceptually grouped, the target becomes salient by contrast. This saliency is reflected in the neuronal responses and animal behavior being more resistant to crowding and occlusion. Thus, target saliency in terms of featural contrasts trumps pooled encoding. These results are consistent with a normalization model where target saliency titrates the relative influence of different stimuli in the normalization pool.
Inhibitory Competition in Figure assignment: Insights from brain and behavior
Mary A. Peterson1; 1University of Arizona
Behavioral and neural evidence indicates that the organization of the visual field into figures (i.e., objects) and their local grounds is not a simple, early, stage of processing, as traditional theories supposed. Instead, figure/object detection entails competition between different interpretations that might be seen. In the first part of my talk, I will discuss behavioral evidence that multiple interpretations compete in the classic demonstration that convexity is a figural prior. In the second part, I will present neural evidence of suppression in the BOLD response to the groundside of objects when a portion of a familiar configuration was suggested there but lost the competition for perception. These results begin to elucidate the complex interactions between local and global, high- and low-level factors involved in perceptually organizing the visual field into objects and backgrounds.
The neuronal mechanisms for object-based attention and how they solve the binding problem
Pieter Roelfsema1,2,3; 1Netherlands Institute for Neuroscience, 2Vrije Universiteit Amsterdam, 3Academic University Medical Center, Amsterdam
Our visual system groups image elements of objects and segregates them from other objects and the background. I will discuss the neuronal mechanisms for these grouping operations, proposing that there are two processes for perceptual grouping. The first is ‘base grouping’, which is a process that relies on neurons tuned to feature conjunctions and occurs in parallel across the visual scene. If there are no neurons tuned to the required feature conjunctions, a second process, called ‘incremental grouping’, comes into play. Incremental grouping is a time-consuming and capacity-limited process, which relies on the gradual spread of enhanced neuronal activity across the distributed representation of an object in the visual cortex, during a delayed phase of the neuronal responses. Incremental grouping can occur for only one object at any one time. The spread of enhanced activity corresponds to the spread of object-based attention at the psychological level of description. Hence, we found that the binding problem is solved by labelling the representation of image elements in the visual cortex with enhanced activity and we did not obtain any evidence for a role of neuronal synchronization. Inhibition of the late-phase activity in primary visual cortex completely blocked figure-ground perception, demonstrating a causal link between enhanced neuronal activity and perceptual organization. These neuronal mechanisms for perceptual grouping account for many of the perceptual demonstrations by the Gestalt psychologists.
Neural correlates of local parallelism during naturalistic vision
Dirk B. Walther1; 1University of Toronto
Human observers can rapidly perceive complex real-world scenes. Grouping visual elements into meaningful units is an integral part of this process. Yet, so far, the neural underpinnings of perceptual grouping have only been studied with simple lab stimuli. We here uncover the neural mechanisms of one important perceptual grouping cue, local parallelism. Using a new, image-computable algorithm for detecting local symmetry in line drawings and photographs, we manipulated the local parallelism content of real world scenes. We decoded scene categories from patterns of brain activity obtained via functional magnetic resonance imaging (fMRI) in 38 human observers while they viewed the manipulated scenes. Decoding was significantly more accurate for scenes containing strong local parallelism compared to weak local parallelism in the parahippocampal place area (PPA), indicating a central role of parallelism in scene perception. To investigate the origin of the parallelism signal we performed a model-based fMRI analysis of the public BOLD5000 dataset, looking for voxels whose activation time course matches that of the locally parallel content of the 4916 photographs viewed by the participants in the experiment. We found a strong relationship with average local symmetry in visual areas V1-4, PPA, and retrosplenial cortex (RSC). Notably, the parallelism-related signal peaked first in V4, suggesting V4 as the site for extracting paralleism from the visual input. We conclude that local parallelismis a perceptual grouping cue that influences neuronal activity throughout the visual hierarchy, presumably starting at V4. Parallelism plays a key role in the representation of scene categories in PPA.