Beyond objects and features: High-level relations in visual perception

Organizers: Chaz Firestone1, Alon Hafri1; 1Johns Hopkins University
Presenters: Alon Hafri, Melissa Le-Hoa Võ, Liuba Papeo, Daniel Kaiser, Hongjing Lu

< Back to 2022 Symposia

A typical VSS program devotes sections to low-level properties such as motion, orientation, and location; higher-level properties such as faces, scenes, and materials; and core visual processes such as working memory and attention. Yet a notable absence among these are relational representations: properties holding *between* elements, beyond any properties each element has on its own. For example, beyond perceiving red apples and glass bowls, we may also see apples contained inside bowls; beyond perceiving an object and its motion, we may see it collide with another object; and beyond perceiving two agents, we may also see them socially interact. The aim of this symposium is to showcase work that investigates relational representation using the methods and tools of vision science, including classic paradigms from visual cognition, modern neuroimaging techniques, and state-of-the-art computational modeling. A central theme is that fully understanding the nature of visual perception — including core processes such as object and scene representation, visual attention, and working memory — requires a consideration of how visual elements relate to one another. First, Alon Hafri and Chaz Firestone will provide an overview of the "relational landscape". They will delineate criteria for determining whether a relational property is perceived rather than merely judged or inferred, and they will discuss several case studies exemplifying this framework. Second, Melissa Võ will discuss her work on "scene grammar", whereby the mind represents natural environments in terms of the typical composition of their objects (e.g., soap generally appears on sinks). Võ suggests that certain clusters of objects (especially "anchor objects") guide visual search, object perception, and memory. Third, Liuba Papeo will present her work on social relations (e.g., when two agents approach, argue, or fight). Papeo shows that the visual system identifies social relations through a prototypical "social template", and she explores the ways such representations generalize across visual contexts. Fourth, Daniel Kaiser will extend the discussion from objects to scene structure. Using neuroimaging evidence, he shows that natural scene processing is fundamentally relational: when configural relations between scene parts are disrupted, there are downstream consequences for scene and object processing. Finally, Hongjing Lu and Phil Kellman will discuss the computational machinery necessary to achieve relational representations. Although deep-learning models achieve remarkable success at many vision tasks, Lu and Kellman present modeling evidence arguing that abstract structure is necessary for representing visual relations in ways that go beyond mere pattern classification. Overall, this work explores how relational structure plays a crucial role in how we see the world around us, and raises important questions for future vision science research. David Marr famously defined vision as the capacity to "know what is where by looking" — to represent objects and their features, located somewhere in space. The work showcased here adds an exciting dimension to this capacity: not only what and where, but "how" visual elements are configured in their physical and social environment.


Perceiving relational structure

Alon Hafri1, Chaz Firestone1; 1Johns Hopkins University

When we open our eyes, we immediately see the colors, shapes, and sizes of the objects around us — round apples, wooden tables, small kittens, and so on — all without effort or intention. Now consider relations between these objects: An apple supported by a table, or two kittens chasing one another. Are these experiences just as immediate and perceptual, or do they require effort and reflection to arise? Which properties of relations are genuinely perceived, and how can we know? Here, we outline a framework for distinguishing perception of relations from mere judgments about them, centered on behavioral "signatures" that implicate rapid, automatic visual processing as distinct from high-level judgment. We then discuss several case studies demonstrating that visual relations fall within this framework. First, we show that physical relations such as containment and support are extracted in an abstract manner, such that instances of these relations involving very different objects are confused for one another in fast target-identification tasks. Second, we show that the mind "fills in" required elements of a relation that are inferred from physical interaction (e.g., a man running into an invisible "wall"), producing visual priming in object detection tasks. Third, we show that when objects look like they can physically fit together, this impression influences numerosity estimates of those objects. We argue that visual processing itself extracts sophisticated, structured relations, and we reflect on the consequences of this view for theorizing about visual perception more broadly.

Hierarchical relations of objects in real-world scenes

Melissa Le-Hoa Võ1; 1Goethe University - Frankfurt

The sources that guide attention in real-world scenes are manifold and interact in complex ways. We have been arguing for a while now that attention during scene viewing is mainly controlled by generic scene knowledge regarding the meaningful composition of objects that make up a scene (a.k.a. scene grammar). Contrary to arbitrary target objects placed in random arrays of distractors, objects in naturalistic scenes are placed in a very rule-governed manner. In this talk, I will highlight some recent studies from my lab in which we have tried to shed more light on the hierarchical nature of scene grammar. In particular, we have found that scenes can be decomposed into smaller, meaningful clusters of objects, which we have started to call "phrases". At the core of these phrases you will find so-called "anchor objects", which are often larger, stationary objects that anchor strong relational predictions about where other objects within the phrase are expected to be. Thus, within a "phrase" the spatial relations of objects are strongly defined. Manipulating the presence of anchor objects, we were able to show that both eye movements and body locomotion are strongly guided by these anchor objects when carrying out actions with naturalistic 3D settings. Overall, the data I will present will provide further evidence for the crucial role that anchor objects play in structuring the composition of scenes and thereby critically affecting visual search, object perception and the forming of memory representations in naturalistic environments.

(In what sense) We see social relations

Liuba Papeo1,2; 1CNRS, 2Université Claude Bernard Lyon

The most basic social relation is realized when two social agents engage in a physical exchange, or interaction. How do representations of social interactions come about, from basic processing in visual perception? Behavioral and neuroimaging phenomena show that human vision (and selective areas of the visual cortex) discriminates between scenes involving the same bodies, based on whether the individuals appear to interact or not. What information in a multiple-body scene channels the representation of social interaction? And what exactly is represented of a social relation in the visual system? I will present behavioral results, based on a switch cost paradigm, showing that the visual system exploits mere spatial information (i.e., relative position of bodies in space and posture features) to "decide" not only whether there is an interaction or not, but also who the agent and the patient are. Another set of results, based on a backward masking paradigm, shows that the visual processing of socially-relevant spatial relations is agnostic to the content of the interaction, and indeed segregated from, and prior to, (inter)action identification. Thus, drawing a divide between perception and cognition, the current results suggest that the visual representation of social relations corresponds to a configuration of parts (bodies/agents) that respect the spatial relations of a prototypical social interaction –a sort of social-template, theoretically analogous to the face- or body-template in the visual system– before inference. How specific/general to different instances of social interaction this template is will be the main focus of my discussion.

The role of part-whole relations in scene processing

Daniel Kaiser1; 1Justus-Liebig-Universität Gießen

Natural scenes are not arbitrary arrangements of unrelated pieces of information. Their composition rather follows statistical regularities, with meaningful information appearing in predictable ways across different parts of the scene. Here, I will discuss how characteristic relations across different scene parts shape scene processing in the visual system. I will present recent research, in which I used variations of a straightforward "jumbling" paradigm, whereby scenes are dissected into multiple parts that are then either re-assembled into typical configurations (preserving part-whole relations) or shuffled to appear in atypical configurations (disrupting part-whole relations). In a series of fMRI and EEG studies, we showed that the presence of typical part-whole relations has a profound impact on visual processing. These studies yielded three key insights: First, responses in scene-selective cortex are highly sensitive to spatial part-whole relations, and more so for upright than for inverted scenes. Second, the presence of typical part-whole structure facilitates the rapid emergence of scene category information in neural signals. Third, the part-whole structure of natural scenes supports the perception and neural processing of task-relevant objects embedded in the scene. Together, these results suggest a configural code for scene representation. I will discuss potential origins of this configural code and its role in efficient scene parsing during natural vision.

Two Approaches to Visual Relations: Deep Learning versus Structural Models

Hongjing Lu1, Phil Kellman1; 1University of California, Los Angeles

Humans are remarkably adept at seeing in ways that go well beyond pattern classification. We represent bounded objects and their shapes from visual input, and also extract meaningful relations among object parts and among objects. It remains unclear what representations are deployed to achieve these feats of relation processing in vision. Can human perception of relations be best emulated by applying deep learning models to massive numbers of problems, or should learning instead focus on acquiring structural representations, coupled with the ability to compute similarities based on such representations? To address this question, we will present two modeling projects, one on abstract relations in shape perception, and one on visual analogy based on part-whole relations. In both projects we compare human performance to predictions derived from various deep learning models and from models based on structural representations. We argue that structural representations at an abstract level play an essential role in facilitating relation perception in vision.

< Back to 2022 Symposia