VSS, May 13-18

Perceptual Organization

Talk Session: Saturday, May 14, 2022, 5:15 – 7:15 pm EDT, Talk Room 1
Moderator: Benjamin van Buren, New School for Social Research NYC

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 3:34 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 5:15 pm, 25.11

Color-motion feature misbinding with optic-flow versus vertical motion

Sunny M. Lee1 (), Steven K. Shevell1; 1University of Chicago

Effortlessly tracking the flight of a red frisbee belies the “binding problem,” a challenge the visual system faces integrating multiple features processed along relatively independent neural pathways. Limitations of the binding process are revealed by feature misbinding of color and motion in the periphery of the visual field: With vertical motion of two differently colored groups of dots and each group moving upward or downward, an incorrect (illusory) conjunction of features can be induced so that, say, red dots moving physically upward are perceived to move downward (Wu, Kanai, & Shimojo, 2004). Aim: We investigated if color-motion feature misbinding observed with vertical motion extends to optic-flow motion, which differs by being an ensemble motion percept with continuous motion vectors from center to periphery. Method: Observers saw overlaid groups of moving red and green dots and reported the motion direction of the peripheral dots of one color while fixating at the center. The motion direction assigned to each set of colored dots could be reversed between the center and the periphery to induce misbinding of color and motion. The motion could be vertical, with upwards or downwards direction, or in radial optic-flow, with expanding or contracting directions. Overall, the red and green dots could vary in motion direction, motion type (vertical or optic-flow), or speed. Results/Conclusion: Radial optic-flow motion revealed color-motion feature misbinding in the periphery, as found previously for vertical motion. Mixing motion types (vertical motion for one color and radial optic-flow motion for the other) reduced misbinding by at least 50%. Mixing different speeds caused a smaller but still significant reduction in misbinding. In sum, peripheral feature misbinding of color and motion that results from grouping moving objects in the center and periphery depends on a common motion type and motion speed in the central and peripheral areas.

Acknowledgements: Supported by NIH EY-026618

Talk 2, 5:30 pm, 25.12

Hidden by letters: How grouping lines into letters interferes with ensemble perception

Sabrina Hansmann-Roth1, Bilge Sayim1,2; 1University of Lille, France, 2University of Bern, Switzerland

The visual system’s ability to extract statistical information such as mean or variance from groups of objects is generally referred to as ensemble perception. In a standard setup, observers are presented with a group of, for example, lines and are asked to indicate their average orientation. Hitherto, ensemble perception is only studied using isolated and randomly displayed features. However, three horizontal lines and a vertical line in the right spatial arrangement can also form a more complex object such as the letter E. One might consequently ask what are the limitations of averaging individual features when they are grouped into objects? Here, we investigate how extracting summary statistics is affected by the arrangement of the individual elements of ensembles. If elements are grouped into complex objects, can these individual elements still be averaged or does grouping break the averaging ability? We presented observers with letters and various types of letter manipulations such as scrambled or disassembled letters and asked them to judge the average orientation by rotating a single line afterwards. Our results revealed strong effects of hierarchical complexity: Compared to scrambled or disassembled letters, averaging performance was impaired when lines were grouped into letters. Moreover, adjustment time and the number of line rotations revealed similar patterns: Observers required more time and rotations to complete their adjustment when lines formed letters compared to the identical, but dissembled lines. Our results show for the first time that once features are part of objects, extracting their summary statistics is impaired showing both, larger adjustment errors and longer reaction times. These results allow for a new perspective on the extraction of statistical information and the question at which level of object processing the averaging of features happens. Our results indicate that summary statistics are extracted before features are combined to more complex objects.

Talk 3, 5:45 pm, 25.13

Objects that look heavier look larger

Hong B. Nguyen1 (), Benjamin van Buren1; 1The New School

Beyond seeing objects in terms of their lower-level features, such as color and motion, we also see them in terms of seemingly higher-level properties, such as their masses, and the physical forces acting upon them. Determining objects’ masses from sensory data is not always straightforward: for example, if we see one object accelerate more slowly than another object, this could reflect the same force acting on objects of different masses, different forces acting on objects of the same mass, etc. In Experiments 1 and 2, we asked whether the visual perception of the relative masses of two ‘launched’ objects influences our perception of those objects’ relative sizes on the screen. When two stationary objects were simultaneously struck by a common launching object, observers reliably saw the slower-accelerating of the two as larger on the screen (a visual ‘weight-size illusion’) — and this effect was attenuated when the objects were instead struck independently by two unconnected launchers (rendering ambiguous whether their different accelerations were due to a difference in mass or to a difference in imparted force). In Experiments 3 and 4, we tested whether the apparent relative mass of two objects on either side of a fulcrum influences our perception of their relative size on the screen. Observers tended to see the heavier-looking object (when the seesaw ‘tipped’, the descending object; when the seesaw remained still, the lower object) as larger — an effect which depended critically on drawing the connecting ‘seesaw’ line between the two objects to convey their participation in the same force system. All four experiments support the conclusion that we automatically see objects in terms of their masses in a way that (1) depends on sophisticated analysis of the forces at play in a scene, and (2) influences the perception of other visual properties.

Talk 4, 6:00 pm, 25.14

How bar graphs deceive: readout-based measurement reveals three fallacies

Jeremy Wilmer1 (), Sarah Kerns2; 1Wellesley College

A substantial portion of all data insights are conveyed via mean values; and mean values are commonly depicted via bar graphs. Bar graphs of mean (BGoMs) are frequently presumed to be accessible to non-experts. Yet evidence for or against this presumption remains sparse. Here, we use a readout-based measurement approach developed by our lab (Wilmer & Kerns, 2021) to document three fundamental fallacies in the interpretation of BGoMs. A readout is a relatively concrete, detailed, uninterpreted record of thought, typically produced via pencil-and-paper drawing. In the present study, each of 133 demographically diverse participants sketched stimulus BGoMs along with their best guess of the individual datapoints that were averaged to produce the shown mean values. As a test of reproducibility, each participant completed drawings for four stimulus graphs that were selected to represent diverse content areas (developmental, clinical, social, cognitive), data types (questionnaire, performance), and BGoM forms (unidirectional, bidirectional) within the broad field of psychology. The three observed fallacies were: (1) a Bar-Tip Limit Error (data plotted inside the bar, rather than spread across the bar-tip, as if the bar represented counts instead of means), (2) a Dichotomization Fallacy (complete non-overlap, or dichotomy, between distributions that should overlap), and (3) a Uniformity Fallacy (data distributed uniformly, rather than in normal (Gaussian) form). These fallacies were largely independent from each other. While they varied somewhat in prevalence between stimulus graphs, each was common and consistently displayed by individuals across graph stimuli. Together, they impacted 52% to 83% of readouts, depending on the stimulus graph. The existence of multiple common severe fallacies in the interpretation of BGoMs raises serious questions about the presumed accessibility of BGoMs. The efficiency and clarity with which these fallacies are revealed by our readout-based approach suggests that readout-based measurement holds promise for the study of graph cognition.

Acknowledgements: This research was funded in part by NSF award #1624891 to JBW, a Brachman Hoffman grant to JBW, and a sub-award from NSF grant #1837731 to JBW.

Talk 5, 6:15 pm, 25.15

Spatial affordances can automatically trigger dynamic visual routines: Spontaneous path tracing in task-irrelevant mazes

Kimberly W. Wong1 (), Brian Scholl1; 1Yale University

Visual processing usually seems both incidental and instantaneous. But imagine viewing a jumble of shoelaces, and wondering whether two particular tips are part of the same lace. You can answer this by looking, but doing so may require something dynamic happening in vision (as the lace is effectively ‘traced’). Such tasks are thought to involve ‘visual routines’: dynamic visual procedures that efficiently compute various properties on demand, such as whether two points lie on the same curve. Past work has suggested that visual routines are invoked by observers’ particular (conscious, voluntary) goals, but here we explore the possibility that some visual routines may also be automatically triggered by certain stimuli themselves. In short, we suggest that certain stimuli effectively *afford* the operation of particular visual routines (as in Gibsonian affordances). We explored this using stimuli that are familiar in everyday experience, yet relatively novel in human vision science: mazes. You might often solve mazes by drawing paths with a pencil — but even without a pencil, you might find yourself tracing along various paths *mentally*. Observers had to compare the visual properties of two probes that were presented along the paths of a maze. Critically, the maze itself was entirely task-irrelevant, but we predicted that simply *seeing* the visual structure of a maze in the first place would afford automatic mental path tracing. Observers were indeed slower to compare probes that were further from each other along the paths, even when controlling for lower-level visual properties (such as the probes’ brute linear separation, i.e. ignoring the maze ‘walls’). This novel combination of two prominent themes from our field — affordances and visual routines — suggests that at least some visual routines may operate in an automatic (fast, incidental, and stimulus-driven) fashion, as a part of basic visual processing itself.

Acknowledgements: This project was funded by ONR MURI #N00014-16-1-2007 awarded to BJS.

Talk 6, 6:30 pm, 25.16

Toward modeling visual routines of object segmentation with biologically inspired recurrent vision models

Lore Goetschalckx1 (), Maryam Zolfaghar1,2, Alekh K. Ashok1, Lakshmi N. Govindarajan1, Drew Linsley1, Thomas Serre1; 1Brown University, 2University of California Davis

A core task of the primate visual system is to organize its retinal input into coherent figural objects. While psychological theories dating back to Ullman (1984) suggest that such object segmentation at least partially relies on feedback, little is known about how these computations are implemented in neural circuits. Here we investigate this question using the neural circuit model of Serre et al. (VSS 2020), which is trained to solve visual tasks by implementing recurrent contextual interactions through horizontal feedback connections. When optimized for contour detection in natural images, the model rivals human performance and exhibits sensitivity to contextual illusions typically associated with primate vision, despite having no explicit constraints to do so. Our goal here is to understand whether the visual routine this feedback model discovers for object segmentation can explain the one used by human observers, as measured in a behavioral experiment where participants judged if a cue dot fell on the same or different object silhouette than a fixation dot (Jeurissen et al. 2016). To train the model, we built a large natural image dataset of object outlines (N~250K), where each sample included a “fixation” dot on one object. The model learned to segment the target object by adopting an incremental grouping strategy resembling the growth-cone family of psychology models for figure-ground segmentation, through which it achieved near-perfect segmentation accuracy on a validation dataset (F1=.98) and the novel stimulus set used by Jeurissen et al. (N=22, F1=.98). Critically, the model exhibited a similar pattern of reaction times as humans, indicating that its circuit constraints reflect possible neural substrates for the visual routines of object segmentation in humans. Overall, our work establishes task-optimized models of neural circuits as an interface for generating experimental predictions that link cognitive science theory with exact neural computations.

Acknowledgements: ONR (N00014-19-1-2029) and NSF (IIS-1912280)

Talk 7, 6:45 pm, 25.17

Human-like signatures of contour integration in deep neural networks

Fenil Doshi1 (), Talia Konkle1, George Alvarez1; 1Harvard University

Deep neural networks have become the de facto models of human visual processing, but currently lack human-like representations of global shape information. For humans, it has been proposed that global shape representation starts with early mechanisms of contour integration. For example, people are able to integrate over local features and detect extended contours embedded in noisy displays, with high sensitivity for straight lines and systematically decreasing sensitivity as contours become increasingly curvilinear (Field et al., 1993). Here, we tested whether deep neural networks have contour detection mechanisms with these human-like perceptual signatures. Considering a deep convolutional neural network trained to do object recognition (Alexnet), we find that the pre-trained layer-wise feature spaces have little to no capacity to detect extended contours. However, when the network was fine-tuned to detect the presence or absence of a hidden contour, the fine-tuned feature spaces were able to perform contour-detection nearly perfectly. Further, using a gradient-based visualization method – guided backpropagation – we find that these fine-tuned classifiers are indeed identifying the full contour, rather than leveraging some unexpected strategy to succeed at the task. Critically, we also found that the scope of fine-tuning was key to achieving human-like contour detection: networks trained only to detect relatively straight contours naturally showed human-like graded accuracy to detect increasingly curvilinear contours, while networks fine-tuned to across the full range of curvature values, or at intermediate curvature levels only, showed distinctly non-human-like signatures, with peaks at the trained curvatures. These results provide a computational argument that human contour detection may actually rely on mechanisms solely designed to amplify relatively linear contours. Further, these results demonstrate that convolutional neural network architectures are capable of proper contour detection, but do not have the relevant inductive biases to develop these contour-integration mechanisms in service of object classification tasks.

Acknowledgements: NSF PAC COMP-COG 1946308, NSF CAREER BCS-1942438

Talk 8, 7:00 pm, 25.18

Representing multiple visual objects in the human brain and convolutional neural networks

Viola Mocz1 (), Su Keun Jeong2, Marvin Chun1, Yaoda Xu1; 1Yale University, 2Chungbuk National University

In both monkey neurophysiology and human fMRI studies, neural responses to a pair of unrelated objects can be well approximated by the average responses of each constituent object shown in isolation. This shows that at higher levels of visual processing, the whole is equal to the average of its parts. Recent convolutional neural networks (CNNs) have achieved human-like object categorization performance, leading some to propose that CNNs are the current best models of the primate visual system. Does the same averaging principle hold in CNNs trained for object classification? Here we re-examined a previous fMRI dataset where human participants viewed object pairs and their constituent objects shown in isolation. We also examined the activations in five CNNs pre-trained for object categorization to the same images shown to the human participants. The CNNs examined varied in architecture and included both shallower networks (Alexnet and VGG-19), deeper networks (Googlenet and Resnet-50), and a recurrent network with a shallower structure designed to capture the recurrent processing in macaque IT (Cornet-S). While responses to object pairs could be fully predicted by responses to single objects in the human lateral occipital cortex, this was found in neither lower nor higher layers in any of the CNNs tested. The whole is thus not equal to the average of its parts in CNNs. This indicates the existence of interactions between the individual objects in a pair that is not present in the human brain, potentially rendering these objects less accessible in CNNs at higher levels of visual processing than they are in the human brain. The present results unveil an important representational difference between the human brain and CNNs.

Acknowledgements: NIH grants 1R01EY022355 and 1R01EY030854 to Y.X.