Perception of Relations, Intuitive Physics

Talk Session: Saturday, May 18, 2024, 5:15 – 6:45 pm, Talk Room 1

Talk 1, 5:15 pm

What Newton did not know about Newton’s cradle: Separating visual routines for cause and effect

Sven Ohl1 (), Martin Rolfs1; 1Humboldt-Universität zu Berlin

In Newton’s cradle a moving object collides with a line of touching stationary objects, causing the object at the very end of the line to move. This demonstration of Newton’s first law of motion is fascinating to watch because the cause and effect of the motion are spatially separated. Here, in a modified version of Newton’s cradle, we exploit this separation in a visual adaptation paradigm to show that there are separate visual routines for detecting cause and effect in a causal interaction. We presented launching events in which a moving disc stopped next to another disc with varying degrees of overlap, and asked observers to indicate whether the first disc caused the second disc to move, or whether the first disc simply passed a stationary one. We fitted psychometric functions to each observer’s reports as a function of disc overlap and determined how these functions were affected by the prolonged presentation of a modified version of Newton’s cradle (i.e., the adaptor). Critically, we obtained psychometric functions for the perceived causality at the cause location and at the effect location in Newton’s cradle and we observed significant negative aftereffects at both the cause and at the effect location (Experiment 1 and 2). Observers reported fewer launches at these locations only when the motion direction of the test event was the same as the adaptor’s motion direction (Experiment 1). Critically, the adaptation was spatially specific: Perception of launches at the location in-between the cause and the effect locations was not affected by adaptation (Experiment 2). These results provide compelling evidence that the perception of causality integrates information from both the cause and the effect location. This integration allows the detection of causal interaction even when the cause and effect are spatially separated in the visual environment.

Talk 2, 5:30 pm

Breaking down a golf swing: Spatio-temporal dynamics of visual motion underlie high-level structuring of observed actions

Zekun Sun1 (), Wenyan Bi1, Ilker Yildirim1,2, Samuel McDougle1,2; 1Department of Psychology, Yale University, 2Wu Tsai Institute, Yale University

To acquire or demonstrate a motor skill, we often break it down into a sequence of steps (e.g., a golf swing has ''backswing'' and ''downswing'' phases). But do we *see* single, smooth actions as containing discrete events? We compiled 20 animations depicting natural actions, spanning sports (e.g., kicking a ball), exercises (e.g., a jumping jack), and everyday tasks (e.g., picking up an object). In Experiment 1, observers determined a ''boundary'' to divide each action into two meaningful units. Consensus among observers implied a similar interpretation of the event structure of each action. Next, we explored whether these actions are spontaneously segmented during visual processing. We reasoned that if we visually represent actions as being divided into units by boundaries, then subtle changes occurring at these boundaries – specifically during the transition between the units – should be less noticeable relative to non-boundary moments. Experiments 2-3 tested observers’ detection of transient slowdowns and frame shifts at boundary, pre-boundary and post-boundary frames. People were worse at detecting changes at boundaries compared to non-boundaries. What kind of information about observed actions drives this effect? Experiments 4-5 applied novel distortions to the videos, removing high-level semantic information while preserving lower-level spatial-temporal dependencies. The boundary effect was weakened yet persisted, suggesting that spatio-temporal dynamics play a crucial role in mental structuring of actions. To quantify these dynamics, we extracted optical flow fields from every two consecutive frames of each video and computed 16 motion statistics from the flow maps to capture global and local motion characteristics. We found that the boundary judgments in Experiment 1 could be predicted by the changes in the magnitude and direction of motion vectors, especially the smoothness of these variations. Our results suggest that the visual system automatically imposes boundaries when observing natural actions via image-computable, spatio-temporal motion patterns.

Acknowledgements: NIH - R01NS132926

Talk 3, 5:45 pm

Fast and automatic processing of relations: the case of containment and support

Sofie vettori1,2 (), Jean-Rémy Hochmann1,2, Liuba Papeo1,2; 1Institut des Sciences Cognitives—Marc Jeannerod, UMR5229, Centre National de la Recherche Scientifique (CNRS), 2Université Claude Bernard Lyon 1

Achieving a meaningful representation of the visual environment, one that can be useful for navigating, planning and acting, requires representing objects and the relations between them. We know that object recognition is efficient, i.e., reportedly fast and automatic; how fast and automatic is the processing of relations? We studied this, focusing on the fundamental relations containment and support, using frequency-tagging electroencephalography (FT-EEG). FT-EEG allows to pinpoint automatic stimulus-locked responses. First, we tested –and demonstrated– that relations between multiple objects are accessed as fast and automatically as the object themselves. Twenty adults viewed a sequence of images with object pairs at a base-frequency (2.5 Hz), where every four stimuli illustrating one relation (support: book on table, knife on chop-board), one oddball-stimulus illustrating the other relation appeared (containment: spoon in cup) (oddball-frequency: 0.625 Hz). EEG signals indicated responses at both frequencies, meaning that participants processed each image and spontaneously detected changes in the relation carried by oddball-stimuli. A control condition demonstrated that the oddball-response was not due to a regular repetition of the objects (spoon and cup). Since the above effect was found with oddball stimuli that involved (different instances of) the same objects (e.g., always spoon in cup), we tested whether the same effect could be found when only the relation remained identical (e.g., containment), while the objects changed for every oddball-stimulus (spoon in cup, fish in bowl). Here, the oddball-response remained significant, demonstrating that it reflected encoding of the relation itself, regardless of the objects involved in it. Finally, the oddball-response remained unchanged when participants were explicitly instructed to attend to the relation, indicating that the encoding of relations is independent from attention. We conclude that relations between objects are encoded rapidly, automatically upon stimulus presentation and in a manner that generalizes over a broad class of objects.

Acknowledgements: SV was supported by a postdoctoral fellowship awarded by a Marie Skłodowska-Curie individual fellowship (MSCA-IF 101108756) by the European Commission. LP was supported by a European Research Council Grant (Project THEMPO-758473).

Talk 4, 6:00 pm

Joint Commitment in Human Cooperative Hunting through an “Imagined We”

Siyi Gong1, Ning Tang2, Minglu Zhao1, Jifan Zhou2, Mowei Shen2, Tao Gao1; 1University of California, Los Angeles, 2Zhejiang University

For human cooperation, jointly selecting a goal out of multiple comparable goals and maintaining the team’s joint commitment to that goal poses a great challenge. By combining psychophysics and computational modeling, we demonstrate that visual perception can support spontaneous human joint commitment without any communication. We developed a real-time multi-player hunting task where human hunters could team up with human or machine hunters to pursue prey in a 2D environment with Newtonian physics. Joint commitment is modeled through an "Imagined We" (IW) approach, wherein each agent uses Bayesian inference to reason the intention of “We”, an imagined supraindividual agent that controls all agents as its body parts. This model is compared against a Reward Sharing (RS) model, which posits cooperation as sharing reward through multi-agent reinforcement learning (MARL). We found that both humans and IW, but not RS, could maintain high team performance by jointly committing to a single prey and coordinating to catch it, regardless of prey quantity or speed. Human observers also rated all hunters of both human and IW teams as having high contributions to the catch, irrespective of their proximity to the prey, suggesting that their high-quality hunting resulted from sophisticated cooperation rather than individual strategies. IW hunters could not only cooperate with their own kind but also with humans, with human-IW teams mirroring the hunting performance and teaming experience of all-human teams. However, substituting human members with more RS hunters reduced both performance and teaming experience. In conclusion, this study demonstrates that humans achieve cooperation through joint commitment that enforces a single goal on the team, rather than merely motivating team members through reward sharing. By extending the joint commitment theory to visually grounded cooperation, our research sheds light on how to build machines that can cooperate with humans in an intuitive and trustworthy manner.

Talk 5, 6:15 pm

Unconscious intuitive physics: Prioritized breakthrough into visual awareness for physically unstable block towers

Kimberly W. Wong1 (), Aalap Shah1, Brian Scholl1; 1Yale University

A central goal of perception and cognition is to predict how events in our local environments are likely to unfold: what is about to happen? And of course some of the most reliable ways of answering this question involve considering the regularities of physics. Accordingly, a great deal of recent research throughout cognitive science has explored the nature of ‘intuitive physics’. The vast majority of this work, however, has involved higher-level reasoning, rather than seeing itself—as when people are asked to deliberate about how objects might move, in response to explicit questions (“Will it fall?”). Here, in contrast, we ask whether the apprehension of certain physical properties of scenes might also occur *unconsciously*, during simple passive viewing. Moreover, we ask whether certain physical regularities are not just processed, but also visually *prioritized*—as when a tower is about to fall. Observers viewed block towers—some stable, some unstable—defined in terms of whether they would collapse as a result of external physical forces (such as gravity) alone. We used continuous flash suppression (CFS) to render the towers initially invisible: observers viewed them monocularly through a mirror haploscope, while a dynamic Mondrian mask was presented to their other eye. We then measured how long towers took to break through this interocular suppression, as observers indicated when they became visually aware of anything other than the mask. The results were clear and striking: unstable towers broke into visual awareness faster than stable towers. And this held even while controlling for other visual properties—e.g. while contrasting pairs of stable vs. unstable towers sharing the same convex hull, and differing only in the horizontal placement of a single block. This work shows how physical instability is both detected and prioritized, not only during overt deliberation, but also in unconscious visual processing.

Talk 6, 6:30 pm

Decoding predicted future states from the brain’s ‘physics engine’

RT Pramod1,2 (), Elizabeth Mieczkowski3, Cyn Fang1,2, Josh Tenenbaum1,2, Nancy Kanwisher1,2; 1Department of Brain and Cognitive Sciences, MIT, 2McGovern Institute for Brain Research, MIT, 3Princeton University

Successful engagement with the physical world requires rapid online prediction, from swerving to avoid a collision to returning a ping-pong serve. Here we test the hypothesis that physical prediction is implemented in a set of parietal and frontal regions (aka the "hypothesized Physics Network '' or hPN) that model the structure of the relevant scene and run forward simulations to predict future states. For physical scene understanding and prediction, contact relationships between objects such as support, containment, and attachment are critical because they constrain an object's fate: if a container moves, so does its containee. In Experiment 1, participants (N = 14) were scanned with fMRI while viewing short videos (~3s) depicting contact (contain, support, attach) and non-contact events. MVPA revealed scenario-invariant decoding of the presence versus absence of a contact relationship that was significant in the hPN but not in the ventral pathway. Experiment 2 tested whether the hPN also carries information about predicted future contact events, as expected if the hPN is engaged in forward simulation. Indeed, the voxel response patterns in hPN distinguishing between perceived contact and no-contact events were similar even for predicted events where contact was predictable but not shown. This prediction of future contact events, which generalized across objects and scenarios, was found even though participants were performing an unrelated one-back task, and was detected only in the hPN, not the ventral visual pathway. In both experiments, the key results were absent in the primary visual cortex, arguing against low-level visual feature confounds accounting for these findings. Thus, we find that the hPN both (a) encodes physical relationships between objects in a scene, and (b) predicts future states of the world, as expected if this network serves as the brain’s ‘Physics Engine’.

Acknowledgements: This project was funded by NSF NCS Project 6945933 (NGK)