Representation in the Visual System by Summary Statistics

Representation in the Visual System by Summary Statistics

Friday, May 7, 3:30 – 5:30 pm
Royal Ballroom 1-3

Organizers: Ruth Rosenholtz, MIT Department of Brain & Cognitive Sciences

Presenters: Ruth Rosenholtz (MIT Department of Brain & Cognitive Sciences), Josh Solomon (City University London), George Alvarez (Harvard University, Department of Psychology), Jeremy Freeman (Center for Neural Science, New York University), Aude Oliva (Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology), Ben Balas (MIT, Department of Brain and Cognitive Sciences)

Symposium Description

What is the representation in early vision?  Considerable research has demonstrated that the representation is not equally faithful throughout the visual field; representation appears to be coarser in peripheral and unattended vision, perhaps as a strategy for dealing with an information bottleneck in visual processing.  In the last few years, a convergence of evidence has suggested that in peripheral and unattended regions, the information available consists of summary statistics.  “Summary statistics” is a general term used to represent a class of measurements made by pooling over visual features of various levels of complexity, e.g. 1st order statistics such as mean orientation; joint statistics of responses of V1-like oriented feature detectors; or ensemble statistics that represent spatial layout information.  Depending upon the complexity of the computed statistics, many attributes of a pattern may be perceived, yet precise location and configuration information is lost in favor of the statistical summary.

This proposed representation for early vision is related to suggestions that the brain can compute summary statistics when such statistics are useful for a given task, e.g. texture segmentation, or explicit judgments of mean size of a number of items.  However, summary statistic models of early visual representation additionally suggest that under certain circumstances summary statistics are what the visual system is “stuck with,” even if more information would be useful for a given task.

This symposium will cover a range of related topics and methodologies.  Talks by Rosenholtz, Solomon, and Alvarez will examine evidence for a statistical representation in vision, and explore the capabilities of the system, using both behavioral experiments and computational modeling.    Freeman will discuss where summary statistics might be computed in the brain, based upon a combination of physiological findings, fMRI, and behavioral experiments.   Finally, we note that a summary statistic representation captures a great deal of important information, yet is ultimately lossy.  Such a representation in peripheral and/or unattended vision has profound implications for visual perception in general, from peripheral recognition through visual awareness and visual cognition.  Rosenholtz, Oliva, and Balas will discuss implications for a diverse set of tasks, including peripheral recognition, visual search, visual illusions, scene perception, and visual cognition.  The power of this new way of thinking about vision becomes apparent precisely from implications for a wide variety of visual tasks, and from evidence from diverse methodologies.


The Visual System as Statistician: Statistical Representation in Early Vision

Ruth Rosenholtz, MIT Department of Brain & Cognitive Sciences; B. J. Balas, Dept. of Brain & Cognitive Sciences, MIT; Alvin Raj, Computer Science and AI Lab, MIT; Lisa Nakano, Stanford; Livia Ilie, MIT

We are unable to process all of our visual input with equal fidelity.  At any given moment, our visual systems seem to represent the item we are looking at fairly faithfully.  However, evidence suggests that our visual systems encode the rest of the visual input more coarsely.  What is this coarse representation?  Recent evidence suggests that this coarse encoding consists of a representation in terms of summary statistics.  For a complex set of statistics, such a representation can provide a rich and detailed percept of many aspects of a visual scene.  However, such a representation is also lossy; we would expect the inherent ambiguities and confusions to have profound implications for vision.  For example, a complex pattern, viewed peripherally, might be poorly represented by its summary statistics, leading to the degraded recognition experienced under conditions of visual crowding.  Difficult visual search might occur when summary statistics could not adequately discriminate between a target-present and distractor-only patch of the stimuli.  Certain illusory percepts might arise from valid interpretations of the available – lossy – information.  It is precisely visual tasks upon which a statistical representation has significant impact that provide the evidence for such a representation in early vision.  I will summarize recent evidence that early vision computes summary statistics based upon such tasks.

Efficiencies for estimating mean orientation, mean size, orientation variance and size variance

Josh Solomon, City University London; Michael J. Morgan, City University London, Charles Chubb, University of California, Irvine

The merest glance is usually sufficient for an observer to get the gist of a scene. That is because the visual system statistically summarizes its input.  We are currently exploring the precision and efficiency with which orientation and size statistics can be calculated. Previous work has established that orientation discrimination is limited by an intrinsic source of orientation-dependent noise, which is approximately Gaussian. New results indicate that size discrimination is also limited by approximately Gaussian noise, which is added to logarithmically transduced circle diameters. More preliminary results include: 1a) JAS can discriminate between two successively displayed, differently oriented Gabors, at 7 deg eccentricity, without interference from 7 iso-eccentric, randomly oriented distractors. 1b) He and another observer can discriminate between two successively displayed, differently sized circles, at 7 deg eccentricity, without much interference from 7 iso-eccentric distractors. 2a) JAS effectively uses just two of the eight uncrowded Gabors when computing their mean orientation. 2b) He and another observer use at most four of the eight uncrowded circles when computing their mean size. 3a) Mean-orientation discriminations suggest a lot more Gaussian noise than orientation-variance discriminations. This surprising result suggests that cyclic quantities like orientation may be harder to remember than non-cyclic quantities like variance. 3b) Consistent with this hypothesis is the greater similarity between noise estimates from discriminations of mean size and size variance.

The Representation of Ensemble Statistics Outside the Focus of Attention

George Alvarez, Harvard University, Department of Psychology

We can only attend to a few objects at once, and yet our perceptual experience is rich and detailed. What type of representation could enable this subjective experience? I have explored the possibility that perception consists of (1) detailed and accurate representations of currently attended objects, plus (2) a statistical summary of information outside the focus of attention. This point of view makes a distinction between individual features and statistical summary features. For example, a single object’s location is an individual feature. In contrast, the center of mass of several objects (the centroid) is a statistical summary feature, because it collapses across individual details and represents the group overall. Summary statistics are more accurate than individual features because random, independent noise in the individual features cancels out when averaged together. I will present evidence that the visual system can compute statistical summary features outside the focus of attention even when local features cannot be accurately reported. This finding holds for simple summary statistics including the centroid of a set of uniform objects, and for texture patterns that resemble natural image statistics. Thus, it appears that information outside the focus of attention can be represented at an abstract level that lacks local detail, but nevertheless carries a precise statistical summary of the scene. The term ‘ensemble features’ refers to a broad class of statistical summary features, which we propose collectively comprise the representation of information outside the focus of attention (i.e., under conditions of reduced attention).

Linking statistical texture models to population coding in the ventral stream

Jeremy Freeman, Center for Neural Science, New York University, Luke E. Hallum, Center for Neural Science & Dept. of Psychology, NYU; Michael S. Landy, Center for Neural Science & Dept. of Psychology, NYU; David J. Heeger, Center for Neural Science & Dept. of Psychology, NYU; Eero P. Simoncelli, Center for Neural Science, Howard Hughes Medical Institute, & the Courant Institute of Mathematical Sciences, NYU

How does the ventral visual pathway encode natural images? Directly characterizing neuronal selectivity has proven difficult: it is hard to find stimuli that drive an individual cell in the extrastriate ventral stream, and even having done so, it is hard to find a low-dimensional parameter space governing its selectivity. An alternative approach is to examine the selectivity of neural populations for images that differ statistically (e.g. in Rust & DiCarlo, 2008). We develop a model of extrastriate populations that compute correlations among the outputs of V1-like simple and complex cells at nearby orientations, frequencies, and positions (Portilla & Simoncelli, 2001). These correlations represent the complex structure of visual textures: images synthesized to match the correlations of an original texture image appear texturally similar. We use such synthetic textures as experimental stimuli. Using fMRI and classification analysis, we show that population responses in extrastriate areas are more variable across different textures than across multiple samples of the same texture, suggesting that neural representations in ventral areas reflect the image statistics that distinguish natural textures. We also use psychophysics to explore how the representation of these image statistics varies over the visual field. In extrastriate areas, receptive field sizes grow with eccentricity. Consistent with recent work by Balas et al. (2009), we model this by computing correlational statistics averaged over regions corresponding to extrastriate receptive fields. This model synthesizes metameric images that are physically different but appear identical because they are matched for local statistics. Together, these results show how physiological and psychophysical measurements can be used to link image statistics to population representations in the ventral stream.

High level visual ensemble statistics: Encoding the layout of visual space

Aude Oliva, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology

Visual scene understanding is central to our interactions with the world. Recognizing the current environment facilitates our ability to act strategically, for example in selecting a route for walking, anticipating where objects are likely to appear, and knowing what behaviors are appropriate in a particular context. In this talk, I will discuss a role for statistical, ensemble representations in scene and space representation. Ensemble features correspond to a higher-level description of the input that summarizes local measurements. With this ensemble representation, the distribution of local features can be inferred and used to reconstruct multiple candidate visual scenes that share similar ensemble statistics. Pooling over local measurements of visual features in natural images is one mechanism for generating a holistic representation of the spatial layout of natural scenes. A model based on such summary representation is able to estimate scene layout properties as humans do.  Potentially, the richness of content and spatial volume in a scene can be at least partially captured using the compressed yet informative description of statistical ensemble representations.

Beyond texture processing: further implications of statistical representations

Ben Balas, MIT, Department of Brain and Cognitive Sciences; Ruth Rosenholtz, MIT; Alvin Raj, MIT

The proposal that peripherally-viewed stimuli are represented by summary statistics of visual structure has implications for a wide range of tasks.  Already, my collaborators and I have demonstrated that texture processing, crowding, and visual search appear to be well-described by such representations, and we suggest that it may be fruitful to significantly extend the scope of our investigations into the affordances and limitations of a “statistical” vocabulary. Specifically, we submit that many tasks that have been heretofore described broadly as “visual cognition” tasks may also be more easily understood within this conceptual framework. How do we determine whether an object lies within a closed contour or not? How do we judge if an unobstructed path can be traversed between two points within a maze? What makes it difficult to determine the impossibility of “impossible” objects under some conditions? These specific tasks appear to be quite distinct, yet we suggest that what they share is a common dependence on the visual periphery that constrains task performance by the imposition of a summary-statistic representation of the input. Here, we shall re-cast these classic problems of visual perception within the context of a statistical representation of the stimulus and discuss how our approach offers fresh insight into the processes that support performance in these tasks and others.