Intuitive Physics and Event Perception

Talk Session: Saturday, May 20, 2023, 5:15 – 6:45 pm, Talk Room 1
Moderator: Jason Fischer, Johns Hopkins University

Talk 1, 5:15 pm, 25.11

The role of agentive and physical forces in the neural representation of motion events

Seda Akbiyik1 (), Oliver Sussman1, Moritz Wurm2, Alfonso Caramazza1,2; 1Harvard University, 2Centre for Mind/Brain Sciences University of Trento

Interpreting dynamic events that involve humans and objects is essential for our daily lives. How does the brain represent information about dynamic events? Neural correlates of dynamic event information have mostly been assessed in relation to human actions, highlighting a set of frontoparietal and posterior temporal regions - the so-called action observation network. However, human actions constitute only a small portion of the events around us. To better understand the complexity of the neural mechanisms involved in event recognition, human actions and events involving inanimate entities should be studied in an integrated manner. In this study, we investigated the neural activity patterns associated with observing animated actions of agents (e.g., an agent hits a chair) in relation to similar object events that were either initiated by agents (e.g., a visible agent makes an object hit a chair) or shaped purely by the physics of a scene (e.g., gravity makes an object fall down a hill and hit a chair). Using fMRI-based MVPA (N = 25), this design allowed testing where in the brain the neural activity patterns associated with motion events change as a function of, or are invariant to, agentive versus physical causes. Cross-decoding revealed a shared neural representation of agent actions and object events throughout the action observation network regardless of whether the movements of the object were initiated by a visible agent or determined purely by physical forces. On the other hand, the right lateral occipitotemporal cortex showed higher sensitivity to cues related to animacy and agency, while the left dorsal premotor cortex was more sensitive to information about physics-laden object events. Overall, our findings shed light on the functional properties of brain regions classically associated with action recognition and highlight their broader role in encoding the kinematics of events.

Talk 2, 5:30 pm, 25.12

Decoding the physics of actions in the dorsal visual pathway

Moritz Wurm1 (), Yiğit Erigüç1; 1University of Trento

Recognizing goal-directed actions is a computationally challenging task, requiring not only the visual analysis of body movements but also of how these movements causally impact, and thereby induce a change in, those objects targeted by an action. We tested the hypothesis that subregions of the dorsal pathway – superior and anterior inferior parietal lobe (SPL and aIPL) – are specialized for the processing of body movements and the effects they induce. In four fMRI sessions, 25 participants observed videos of actions (e.g. breaking a stick, squashing a plastic bottle) along with corresponding point-light-displays, pantomimes, and abstract animations of agent-object interactions (e.g. a circle dividing or compressing a rectangle). By decoding actions across different stimulus formats (e.g. training a classifier to discriminate activation patterns associated with actions, testing the classifier on activation patterns associated with animations), we isolated different action components: Cross-decoding between actions and animations revealed that aIPL encodes abstract representations of effect structures independent of motion and object identity (e.g. dividing or compressing object). By contrast, cross-decoding between actions and point-light-displays revealed that SPL represents body movements irrespective of visible interactions with objects (interaction: F(1,24)=35.1, p=4.9E-06). Moreover, cross-decoding between pantomimes and animations revealed that right aIPL represents action effects even in response to implied object interactions, whereas left aIPL represents action effects exclusively in response to visible interactions with objects. These results demonstrate that the dorsal pathway contains distinct subregions tuned to different physical action features, such as how body parts move in space relative to each other and how body parts interact with objects to induce a change (e.g. in position, shape, or state). The high level of abstraction revealed by cross-decoding suggests a general neural code supporting mechanical reasoning about movement kinematics of entities and about how entities interact with, and have effects on, each other.

Talk 3, 5:45 pm, 25.13

What does learning look like? Inferring epistemic intent from observed actions

Sholei Croom1 (), Hanbei Zhou1, Chaz Firestone1; 1Johns Hopkins University

Beyond recognizing objects, faces, and scenes, we can also recognize the actions of other people. Accordingly, a large literature explores how we make inferences about behaviors such as walking, reaching, pushing, lifting, and chasing. However, in addition to actions with physical goals (i.e., trying to *do* something), we also perform actions with epistemic goals (i.e., trying to *learn* something). For example, someone might press on a door to figure out whether it is locked, or shake a box to determine its contents (e.g., a child wondering if a wrapped-up present contains Lego blocks or a teddy bear). Such ‘epistemic actions’ raise an intriguing question: Can observers tell, just by looking, what another person is trying to learn? And if so, how fine-grained is this ability? We filmed volunteers playing two rounds of a ‘physics game’ in which they shook an opaque box to determine either (a) the number of objects hidden inside, or (b) the shape of the objects hidden inside. Then, an independent group of participants watched these videos (without audio) and were instructed to identify which videos showed someone shaking for number and which videos showed someone shaking for shape. Across multiple task variations and hundreds of observers, participants succeeded at this discrimination, accurately determining which actors were trying to learn what, purely by observing the box-shaking dynamics. This result held both for easy discriminations (e.g., 5-vs-15) and hard discriminations (e.g., 2-vs-3), and both for actors who correctly guessed the contents of the box and actors who failed to do so — isolating the role of epistemic *intent* per se. We conclude that observers can visually recognize not only what someone wants to do, but also what someone wants to know, introducing a new dimension to research on visual action understanding.

Acknowledgements: NSF BCS 2021053

Talk 4, 6:00 pm, 25.14

That’s just how I roll!: Predicting and remembering objects’ locations via the perception of frictive surface contact

Hong B. Nguyen1 (), Benjamin van Buren1; 1The New School

Here we show how an implicit model of the physical force of friction is embedded in the operation of visual attention and memory. Most objects that we see are in frictive contact with a ‘floor’, such that clockwise rotation causes rightward movement, and counterclockwise rotation causes leftward movement. In Experiment 1 we reasoned that, due to this regularity, seeing an isolated, rotating ‘wheel’ might orient spatial attention in the direction the wheel would normally move if touching a floor. Indeed, we found that clockwise rotation produced faster responses to subsequent targets appearing on the right vs. left (and vice versa for counterclockwise rotation). In Experiment 2, we asked whether this ‘rotation cueing’ effect might also be sensitive to visible contact with another surface. We found that the rotating wheel produced a stronger cueing effect when seen touching (vs. not touching) a visible floor, and the *opposite* cueing pattern when seen touching a ‘ceiling’. Thus rotating objects orient spatial attention in a way which by default assumes frictive floor contact, but which is also highly sensitive to visual cues to surface contact in the scene. In Experiments 3 and 4, we asked whether memory for rotating objects’ locations similarly models their frictive interactions with other surfaces. Observers tend to misremember a moving object’s last-seen position as displaced in its direction of movement (a memory bias called ‘Representational Momentum’, or RM). We found that a lone, rightward-moving wheel produced more RM when it rotated clockwise (and vice versa for a leftward-moving wheel). Moreover, this effect was also sensitive to additional visual cues to surface contact, weakening for wheels shown near but not touching another surface, and again reversing for wheels seen touching a ceiling. To predict and remember objects’ positions, we implicitly model the propulsive consequences of the force of friction.

Talk 5, 6:15 pm, 25.15

Using fMRI to study the neural basis of violation-of-expectation

Shari Liu1,2 (), Kirsten Lydic2, Rebecca Saxe2; 1Johns Hopkins University, 2Massachusetts Institute of Technology

Why do babies look longer when objects float in midair, or people behave inefficiently (Carey; 2009; Spelke 2022) during violation-of-expectation (VOE) studies? Here we test two non-mutually exclusive hypotheses. One hypothesis is that VOE is supported (H1) by domain-general processes, like visual prediction error and endogenous attention. A second hypothesis is that VOE is supported (H2) by domain-specific prediction error over psychological and physical expectations. These hypotheses predict responses in distinct neural regions. Whereas the domain-general hypothesis predicts greater responses to unexpected than expected events in visual and multiple demand regions, that generalize across domains, the domain-specific hypothesis predicts greater responses to unexpected events in different regions depending on the domain (e.g. supramarginal gyrus for physics, superior temporal sulcus for psychology; Deen et al., 2015; Fischer et al., 2016). To test both hypotheses, we scanned 17 adults using fMRI while they watched videos of agents and objects, adapted from infant behavioral research. Exploratory univariate fROI analyses showed that primary visual cortices responded equally to unexpected and expected events, suggesting that VOE does not evoke low-level visual prediction error. Regions in the multiple demand network (Fedorenko et al., 2013), like inferior frontal cortex and anterior insula, responded more to unexpected events across domains, though with smaller effect sizes, providing some support for domain-general endogenously driven attention. Lastly, supramarginal gyrus, a region involved in physical reasoning, responded more to unexpected than expected physical events (and not psychological events), providing evidence for domain-specific prediction error. In contrast, superior temporal sulcus, a region involved in social perception, responded more to unexpected than expected events from both domains, though with greater responses to psychological events overall. In sum, in adult brains, both domain-specific and domain-general regions encode violation-of-expectation involving agents and objects, paving the way towards future work in human infants.

Acknowledgements: DARPA CW3013552, NIH F32HD103363

Talk 6, 6:30 pm, 25.16

"Things" versus "Stuff" in the Brain

Vivian C. Paulun1 (), RT Pramod1, Nancy Kanwisher1; 1Massachusetts Institute of Technology

In a seminal paper published two decades ago, Adelson (2001) noted that "Our world contains both things and stuff, but things tend to get the attention." This remains the case today in the field of cognitive neuroscience. The large number of publications using fMRI to explore the lateral occipital complex (LOC) have focused almost exclusively on the role of this region in extracting the 3D shape of Things, without asking whether this region may also respond to Stuff with no fixed shape like honey, sand, or water. Similarly, investigations of the "physics network" previously implicated in visual intuitive physics (Fischer et al, 2016) have to date tested only Things, even though the physics of Stuff plays a comparable role in everyday life. Here, we asked whether LOC and the physics network are engaged when observing Stuff. We created 120 photorealistic short movie clips of four different computer-simulated substances—liquids and granular Stuff, and non-rigid and rigid Things—interacting with other objects, e.g., colliding with obstacles. The four types of videos, as well as scrambled versions of each, were presented in a blocked fMRI design while subjects (N=6) performed an orthogonal color-change detection task. Independently-localized LOC and the physics network showed higher activation for all materials than for scrambled controls (p < .05), whereas the opposite pattern was found for V1 (p < .05). Most importantly, we find that the physics network responded more to rigid and non-rigid Things than liquid and granular Stuff (p < .05), whereas LOC responded at least as strongly to Stuff as to Things. These findings suggest that the physics network may be more engaged in the physics of Things than Stuff, whereas LOC is not restricted to extracting the fixed 3D shape of Things but is equally engaged by Stuff with dynamically changing shapes.

Acknowledgements: This work was supported by the German Research Foundation (grant PA 3723/1-1 to VCP), NIH grant DP1HD091947 to NK, a US/UK ONR MURI project (Understanding Scenes and Events through Joint Parsing, Cognitive Reasoning and Lifelong Learning), NSF STC Grant CCF-1231216, and NSF Project 2124136.