Scene Perception: Behaviour, psychophysics

Talk Session: Sunday, May 19, 2024, 8:15 – 9:45 am, Talk Room 2

Talk 1, 8:15 am

Beyond Words: Rapid Scene Detection is Facilitated by High Semantic Complexity

Emily Lo1, Kaiki Chiu1, Quinn O'Connor1, Michelle R. Greene1; 1Barnard College

The adage “a picture is worth a thousand words” underscores the notion that visual information conveys rich meaning. However, not all scenes contain equal semantic depth. This study quantified the semantic complexity of images and assessed its implications for early visual processing. We asked 100 online observers to write image descriptions for a previously-used set of 1000 images (Bainbridge & Baker, 2020; Greene & Trivedi, 2023). A composite semantic complexity score was computed from the median word count, variability among descriptions (entropy in a bag of words model), and average pairwise distance between concepts within a description from a word vector model (Word2Vec). We selected 100 images with the highest semantic complexity scores and 100 images with the lowest semantic complexity for a rapid detection experiment. The two image groups did not differ significantly in several measures of visual complexity. We predicted that images with lower semantic complexity convey less information, thus observers would more quickly and accurately detect such images. Observers (N=38) distinguished between scene images and 1/f noise (SOA: ~60 ms, with a dynamic pattern mask. Contrary to our expectations, observers had higher detection sensitivity for images with greater lexical complexity (d’: 3.91 vs 3.58, p<0.005). This finding challenges the common expectation of capacity limitations in the face of stimulus complexity. Instead, it suggests that semantic richness may enhance rapid perception. One interpretation is that a more extensive set of contextual associations increases both semantic complexity and visual detectability. Alternatively, richer semantic content may engage top-down processing more effectively, aiding rapid visual detection. These results challenge typical views of cognitive load and point to highly semantic aspects of scene gist that drive early visual detection.

Acknowledgements: NSF CAREER 2240815 to MRG

Talk 2, 8:30 am

The psychophysics of compositionality: Relational scene perception occurs in a canonical order

Alon Hafri1 (), Zekun Sun2, Chaz Firestone3; 1University of Delaware, 2Yale University, 3Johns Hopkins University

An intriguing proposal in recent literature is that vision is compositional: Just as individual words combine into larger linguistic structures (as when “vase,” “table,” and “on” compose into the phrase “the vase on the table”), many visual representations contain discrete constituents that combine in systematic ways (as when we perceive a vase on a table in terms of the vase, the table, and the relation physical-support). This raises a question: What principles guide the compositional process? In particular, how are such representations composed in time? Here we explore the psychophysics of scene composition, using spatial relations as a case study. Inspired by insights from psycholinguistics, we test the intriguing hypothesis that the mind builds relational representations in a canonical order, such that ‘reference’ objects (those that are large, stable, and/or exert physical ‘control’; e.g., tables)—rather than ‘figure’ objects (e.g., vases resting atop them)—take precedence in forming relational representations. In Experiment 1, participants performed a ‘manual construction’ task, positioning items to compose scenes from sentences (e.g., “the vase is on the table”). As hypothesized, participants placed reference-objects first (e.g., table, then vase). Next, we explored whether this pattern arises in visual processing itself. In Experiment 2, participants were faster to recognize a target scene specified by a sentence when the reference-object (table) appeared before the figure-object (vase) than vice-versa. Notably, this pattern arose regardless of word order (reference- or figure-first) and generalized to different objects and relations. Follow-ups showed that this effect emerges rapidly (within 100ms; Experiment 3), persists in a purely visual task (Experiment 4), and cannot be explained by size or shape differences between objects (Experiment 5). Our findings reveal psychophysical principles underlying visual compositionality: the mind builds relational representations in a canonical order, respecting each element’s role in the relation.

Acknowledgements: NSF BCS #2021053 awarded to C.F.

Talk 3, 8:45 am

The role of object co-occurrence in attentional guidance: evidence from eye-movements

Alexadra Theodorou1 (), John Henderson1; 1University of California, Davis

The visual world is complex, yet visual information processing is effortless. During scene viewing semantically related objects are prioritized for attention (Hayes & Henderson, 2021). Previous work has defined semantic relations relevant for gaze guidance based on models from computational linguistics. Here we aim to extend previous findings by investigating relationships between objects derived from their visual scene contexts. Neuroimaging and behavioral data have shown that objects that tend to co-occur in scenes are closely represented in the aPPA while frequently co-occurring objects receive higher similarity judgements in a behavioral task (Bonner & Epstein, 2020; Magri, Elmoznino & Bonner, 2023). Here, we investigate measures of object-object relations derived from their visual co-occurrence statistics in scenes to predict eye-movement behavior. Eye-movement data was collected from 100 participants who each viewed 100 scenes performing a free-viewing task. Using object label embeddings from the object2vec model (Bonner & Epstein, 2020) we constructed map-level representations that encode similarity between objects based on their likelihood to appear within the same scene. We used generalized mixed effects models to estimate gaze behavior as a function of co-occurrence values. Our results suggest that objects that are more highly related to other objects within a scene as a function of their co-occurrence likelihood are more likely to be fixated. These findings underscore the role of statistical regularities, particularly in the form of co-occurrence statistics within visual contexts, in shaping efficient eye-movement behavior. Consequently, our study suggests that object co-occurrence forms an integral part of the semantic representations guiding eye movements, contributing significantly to our understanding of object representational dimensions in scene exploration.

Talk 4, 9:00 am

Efficient coding of ensemble stimuli relative to a dynamic reference

Long Ni1, Alan A. Stocker1; 1The University of Pennsylvania

When discriminating the average of a stimulus ensemble against a reference, observers often overweigh those stimuli in the ensemble that have feature values similar to the reference—a behavior known as ‘robust averaging’. We previously proposed that this behavior can be explained by a Bayesian decision model constrained by efficient coding. Assuming our visual system rapidly forms efficient representations of ensemble stimuli relative to a dynamic reference, our model captured multiple existing datasets showing robust averaging of low-level stimulus ensembles. Here, we provide further evidence for two key predictions of the model: robust averaging should 1) become progressively more pronounced the longer the visual system is exposed to the ensemble stimuli statistics and 2) be reduced when the distribution of the ensemble stimuli is uniform. To test the first prediction, we had subjects discriminate the average orientation of 12 gratings displayed on a virtual circle against a central reference grating during three sessions. In every trial, ensemble orientations were drawn from a Gaussian distribution with various means relative to the (variable) reference orientation, overall creating an approximately Gaussian distribution of ensemble orientations around the reference. Across the three sessions, subjects’ discrimination accuracy continuously improved and the weighting kernel became increasingly non-uniform, attributed by our model to a reduction in internal noise and a progressively better adaptation to the ensemble statistics. We tested the second prediction by sampling orientations from two oppositely ‘skewed’ linear distributions, resulting in an overall uniform distribution centered at the reference. Subjects completed three sessions each under both Gaussian and uniform conditions. While accuracy was similar in both, robust averaging was largely absent in the uniform condition. The alignment between our model’s predictions and empirical data validates our hypothesis that the visual system can dynamically create efficient sensory representations of ensemble stimuli relative to a trial-by-trial varying reference.

Talk 5, 9:15 am

Mapping a scene from afar: Allocentric representation of locations in scene-space

Anna Shafer-Skelton1 (), Russell Epstein1; 1University of Pennsylvania

Spatial neuroscience has discovered a great deal about how animals—primarily rodents—encode allocentric (world-centered) cognitive maps. We hypothesized that humans might be able to form such maps from afar, through visual processing alone. Previous work in vision science has explored how we extract the overall shape of scenes from particular points of view, but little is known about how we form allocentric representations of discrete locations within a scene—a key feature of a cognitive map. We tested for such a representation in two behavioral experiments. In Exp. 1, N=30 participants viewed images of a 3D-rendered courtyard, taken from one of 4 possible viewpoints outside and slightly above the courtyard, spaced 90 degrees apart. On each trial, participants saw two courtyard images separated by a brief (500ms) delay. Within each image was an indicator object (a car), in one of six possible allocentric locations; participants reported whether the indicator object was facing the same or different allocentric direction in the two images. The task was designed to direct attention to the location of the indicator object within the allocentric framework of the courtyard without requiring explicit reporting of that location. We observed a significant performance benefit in across-viewpoint trials when the indicator object was in the same allocentric location in both images compared to when it was in different allocentric locations (BIS p=0.009; we also report d-prime: p=0.023, RT: p=0.062). In Exp. 2 (N=30), we replicated this same-location benefit when participants viewed a continuous stream of courtyard images and performed a 1-back task on the facing direction of the indicator object (BIS p=0.004; secondary measures d-prime: p=0.026, RT: 0.023). These results show evidence for an allocentric representation of within-scene locations—a critical ingredient of allocentric cognitive maps—formed via visual exploration, without traversing the space.

Acknowledgements: This work was supported by a NIH-NEI grant awarded to RAE (R01-EY022350)

Talk 6, 9:30 am

Automatic Logical Inferences In Visual Scene Processing

Nathaniel Braswell1 (), Chaz Firestone2, Nicolò Cesana-Arlotti1; 1Yale University, 2Johns Hopkins University

The human capacity for logic is responsible for some of our grandest achievements; without it, formal mathematics, economic systems, and architectural marvels would be elusive. Yet logical cognition is not limited to rarefied intellectual challenges—it also arises in everyday contexts, such as inferring that a glass on a table must be yours because your friend is holding theirs. Previous work shows that a primitive logical operation—disjunctive syllogism (p OR q; NOT p; therefore, Q)—is deployed by infants to infer the identities of objects (Cesana-Arlotti et al., 2018). This raises an intriguing question: Do such logical inferences arise automatically in adults, and even impact processing of visual scenes? Experiment 1 showed adults events wherein an ambiguous object was ‘scooped’ by a cup from a two-item set (snake and ball). Upon seeing one of the objects outside the cup (snake), adults responded slower when the revealed object’s identity violated their logical prediction (snake) than when it was consistent (ball). The effect persisted over 40 trials, even though the revealed identity was random—suggesting that adults were executing this inference automatically. Put differently, they ‘couldn’t help’ but infer the hidden object’s identity, even when they knew they shouldn’t. Experiment 2 tested whether this effect resulted from one item’s appearance priming the other. We devised scenes with a third item in the cup, preventing logical inferences about the cup’s contents. A Bayes Factor analysis found strong evidence for the null hypothesis of no response time differences, confirming that logical inference drives the Experiment 1 effect. These findings open avenues in both logical cognition and scene processing. First, our results suggest that logical inferences may be spontaneously deployed to resolve visually uncertain events. Additionally, methods from vision science may serve as a previously unexplored tool for uncovering the nature of our mind's fundamental logical capacities.

Acknowledgements: NSF BCS #2021053 awarded to C.F.