Feedforward & Recurrent Streams in Visual Perception

Organizers: Shaul Hochstein1, Merav Ahissar2; 1Life Sciences, Hebrew University, Jerusalem, 2Psychology, Hebrew University, Jerusalem
Presenters: Jeremy M Wolfe, Shaul Hochstein, Catherine Tallon-Baudry, James DiCarlo, Merav Ahissar

< Back to 2021 Symposia

Forty years ago, Anne Treisman presented Feature Integration Theory (FIT; Treisman & Gelade, 1980). FIT proposed a parallel, preattentive first stage and a serial second stage controlled by visual selective attention, so that search tasks could be divided into those performed by the first stage, in parallel, and those requiring serial processing and further “binding” in an object file (Kahneman, Treisman, & Gibbs, 1992). Ten years later, Jeremy Wolfe expanded FIT with Guided Search Theory (GST), suggesting that information from the first stage could guide selective attention in the second (Wolfe, Cave & Franzel, 1989; Wolfe, 1994). His lab’s recent visual search studies enhanced this theory (Wolfe, 2007), including studies of factors governing search (Wolfe & Horowitz, 2017), hybrid search (Wolfe, 2012; Nordfang, Wolfe, 2018), and scene comprehension capacity (Wick … Wolfe, 2019). Another ten years later, Shaul Hochstein and Merav Ahissar proposed Reverse Hierarchy Theory (RHT; Hochstein, Ahissar, 2002), turning FIT on its head, suggesting that early conscious gist perception, like early generalized perceptual learning (Ahissar, Hochstein, 1997, 2004), reflects high cortical level representations. Later feedback, returning to lower levels, allows for conscious perception of scene details, already represented in earlier areas. Feedback also enables detail-specific learning. Follow up found that top-level gist perception primacy leads to the counter-intuitive results that faces pop out of heterogeneous object displays (Hershler, Hochstein, 2005), individuals with neglect syndrome are better at global tasks (Pavlovskaya … Hochstein, 2015), and gist perception includes ensemble statistics (Khayat, Hochstein, 2018, 2019; Hochstein et al., 2018). Ahissar’s lab mapped RHT dynamics to auditory systems (Ahissar, 2007; Ahissar etal., 2008) in both perception and successful/failed (from developmental disabilities) skill acquisition (Lieder … Ahissar, 2019) James DiCarlo has been pivotal in confronting feedforward-only versus recurrency-integrating network models of extra-striate cortex, considering animal/human behavior (DiCarlo, Zoccolan, Rust, 2012; Yarmins … DiCarlo, 2014; Yamins, DiCarlo, 2016). His large-scale electrophysiology recordings from behaving primate ventral stream, presented with challenging object-recognition tasks, relate directly to whether recurrent connections are critical or superfluous (Kar … DiCarlo, 2019). He recently developed combined deep artificial neural network modeling, synthesized image presentation, and electrophysiological recording to control neural activity of specific neurons and circuits (Bashivan, Kar, DiCarlo, 2019). Cathrine Tallon-Baudry uses MEG/EEG recordings to study neural correlates of conscious perception (Tallon-Baudry, 2012). She studied roles of human brain oscillatory activity in object representation and visual search tasks (Tallon-Baudry, 2009), analyzing effects of attention and awareness (Wyart, Tallon-Baudry, 2009). She has directly tested, with behavior and MEG recording, implications of hierarchy and reverse hierarchy theories, including global information processing being first and mandatory in conscious perception (Campana, Tallon-Baudry, 2013; Campana … Tallon-Baudry, 2016) In summary, bottom-up versus top-down processing theories reflect on the essence of perception: the dichotomy of rapid vision-at-a-glance versus slower vision-with-scrutiny, roles of attention, hierarchy of visual representation levels, roles of feedback connections, sites and mechanisms of various visual phenomena, and sources of perceptual/cognitive deficits (Neglect, Dyslexia, ASD). Speakers at the proposed symposium will address these issues with both a historical and forward looking perspective.

Presentations

Is Guided Search 6.0 compatible with Reverse Hierarchy Theory

Jeremy M Wolfe1; 1Harvard Medical School and Visual Attention Lab Brigham & Women’s Hospital

It has been 30 years since the first version of the Guided Search (GS) model of visual search was published. As new data about search accumulated, GS needed modification. The latest version is GS6. GS argues that visual processing is capacity-limited and that attention is needed to “bind” features together into recognizable objects. The core idea of GS is that the deployment of attention is not random but is “guided” from object to object. For example, in a search for your black shoe, search would be guided toward black items. Earlier versions of GS focused on top-down (user-driven) and bottom-up (salience) guidance by basic features like color. Subsequent research adds guidance by history of search (e.g. priming), value of the target, and, most importantly, scene structure and meaning. Your search for the shoe will be guided by your understanding of the scene, including some sophisticated information about scene structure and meaning that is available “preattentively”. In acknowledging the initial, preattentive availability of something more than simple features, GS6 moves closer to ideas that are central to the Reverse Hierarchy Theory of Hochstein and Ahissar. As is so often true in our field, this is another instance where the answer is not Theory A or Theory B, even when they seem diametrically opposed. The next theory tends to borrow and synthesize good ideas from both predecessors.

Gist perception precedes awareness of details in various tasks and populations

Shaul Hochstein1; 1Life Sciences, Hebrew University, Jerusalem

Reverse Hierarchy Theory proposes several dramatic propositions regarding conscious visual perception. These include the suggestion that, while the visual system receives scene details and builds from them representations of the objects, layout, and structure of the scene, nevertheless, the first conscious percept is that of the gist of the scene – the result of implicit bottom-up processing. Only later does conscious perception attain scene details by return to lower cortical area representations. Recent studies at our lab analyzed phenomena whereby participants receive and perceive the gist of the scene before and without need for consciously knowing the details from which the gist is constructed. One striking conclusion is that “pop-out” is an early high-level effect, and is therefore not restricted to basic element features. Thus, faces pop-out from heterogeneous objects, and participants are unaware of rejected objects. Our recent studies of ensemble statistics perception find that computing set mean does not require knowledge of its individuals. This mathematically-improbable computation is both useful and natural for neural networks. I shall discuss just how and why set means are computed without need for explicit representation of individuals. Interestingly, our studies of neglect patients find that their deficit is in terms of tasks requiring focused attention to local details, and not for those requiring only global perception. Neglect patients are quite good at pop-out detection and include left-side elements in ensemble perception.

From global to local in conscious vison: behavior & MEG

Catherine Tallon-Baudry1; 1CNRS Cognitive Neuroscience, Ecole Normale Supérieure, Paris

The reverse hierarchy theory makes strong predictions on conscious vision. Local details would be processed in early visual areas before being rapidly and automatically combined into global information in higher order area, where conscious percepts would initially emerge. The theory thus predicts that consciousness arises initially in higher order visual areas, independently from attention and task, and that additional and optional attentional processes operating from top to bottom are needed to retrieve local details. We designed novel textured stimuli that, as opposed to Navon’s letters, are truly hierarchical. Taking advantage of both behavioral measures and of the decoding of MEG data, we show that global information is consciously perceived faster than local details, and that global information is computed regardless of task demands during early visual processing. These results support the idea that global dominance in conscious percepts originates in the hierarchical organization of the visual system. Implications for the nature of conscious visual experience and its underlying neural mechanisms will be discussed.

Next-generation models of recurrent computations in the ventral visual stream

James DiCarlo1; 1Neuroscience, McGovern Inst. & Brain & Cognitive Sci., MIT

Understanding mechanisms underlying visual intelligence requires combined efforts of brain and cognitive scientists, and forward engineering emulating intelligent behavior (“AI engineering”). This “reverse-engineering” approach has produced more accurate models of vision. Specifically, a family of deep artificial neural-network (ANN) architectures arose from biology’s neural network for object vision — the ventral visual stream. Engineering advances applied to this ANN family produced specific ANNs whose internal in silico “neurons” are surprisingly accurate models of individual ventral stream neurons, that now underlie artificial vision technologies. We and others have recently demonstrated a new use for these models in brain science — their ability to design patterns of light energy images on the retina that control neuronal activity deep in the brain. The reverse engineering iteration loop — respectable ANN models to new ventral stream data to even better ANN models — is accelerating. My talk will discuss this loop: experimental benchmarks for in silico ventral streams, key deviations from the biological ventral stream revealed by those benchmarks, and newer in silico ventral streams that partly close those differences. Recent experimental benchmarks argue that automatically-evoked recurrent processing is critically important to even the first 300msec of visual processing, implying that conceptually simpler, feedforward only, ANN models are no longer tenable as accurate in silico ventral streams. Our broader aim is to nurture and incentivize next generation models of the ventral stream via a community software platform termed “Brain-Score” with the goal of producing progress that individual research groups may be unable to achieve.

Visual and non-visual skill acquisition – success and failure

Merav Ahissar1; 1Psychology Department, Social Sciences & ELSC, Hebrew University, Israel

Acquiring expert skills requires years of experience – whether these skills are visual (e.g. face identification), motor (playing tennis) or cognitive (mastering chess). In 1977, Shiffrin & Schneider proposed an influential stimulus-driven, bottom-up theory of expertise automaticity, involving mapping stimuli to their consistent response. Integrating many studies since, I propose a general, top-down theory of skill acquisition. Novice performance is based on the high-level multiple-demand (Duncan, 2010) fronto-parietal system, and with practice, specific experiences are gradually represented in lower-level domain-specific temporal regions. This gradual process of learning-induced reverse-hierarchies is enabled by detection and integration of task-relevant regularities. Top-down driven learning allows formation of task-relevant mapping and representations. These in turn form a space which affords task-consistent interpolations (e.g. letters in a manner crucial for letter identification rather than visual similarity). These dynamics characterize successful skills. Some populations, however, have reduced sensitivity to task-related regularities, hindering their related skill acquisition, preventing specific expertise acquisition even after massive training. I propose that skill-acquisition failure, perceptual as cognitive, reflects specific difficulties in detecting and integrating task-relevant regularities, impeding formation of temporal-area expertise. Such is the case for individuals with dyslexia (reduced retention of temporal regularities; Jaff-Dax et al., 2017), who fail to form an expert visual word-form area, and for individuals with autism (who integrate regularities too slowly for online updating; Lieder et al., 2019). Based on this general conceptualization, I further propose that this systematic impediment.

< Back to 2021 Symposia