From Egocentric Video to Patchwork Physics: Building Empirical Priors from Everyday Experience

Poster Presentation 23.338: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Scene Perception: Intuitive physics

Abdul-Rahim Deeb1, Jason Fischer2, Leyla Isik1; 1Johns Hopkins University, 2University of Coimbra

Humans can anticipate how objects and materials will behave as they move, collide, or change state, and use these expectations to guide behavior. This capacity of intuitive physics is embedded in perception and action: recent lab-based studies show that physical regularities can act as predictive cues that are combined with incoming sensory information to support successful interaction with the environment. This raises a central question: how does perception internalize these regularities? We recently proposed a computational framework, the Patchwork Approach, that answers this by treating intuitive physics as a multivariate perceptual manifold shaped by environmental statistics, rather than as explicit simulation over symbolic physical states. Dynamic events are located in this space by behaviorally relevant dimensions (e.g., object velocity, path geometry, apparent volume), and predictions are implemented as local computations over this structure. A remaining open question however, is what experiences shape this manifold and how can we measure them? Here we treat large-scale egocentric video as a proxy for lived experience. Datasets such as HOI4D and EGO4D use head-mounted cameras to capture thousands of everyday actions from the observer’s viewpoint. We developed a novel pipeline combining hand–object segmentation, temporal tracking, and monocular depth estimation, to extract physical variables from any egocentric video, including 3D trajectories, instantaneous velocities, path curvature, contact onset and duration, and object masks and volume. These features define the Patchwork manifold and allow us to estimate local physical predictions. We find that video-derived slopes relating apparent mass to speed, and those linking angular launch kinematics to object trajectories, reproduce observers’ biases from our prior lab-based experiments, despite using no fitted free parameters.This methodology provides a concrete, image-computable route to estimating perceptual priors from natural interaction and opens the door to probing how such priors emerge across datasets, domains, and developmental stages in vision science.

Acknowledgements: NIH R01MH132826