Critically Evaluating Physical Reasoning in Computational Models

Poster Presentation 23.343: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Scene Perception: Intuitive physics

Grace Hu1 (), Nancy Kanwisher1, RT Pramod1; 1MIT

Humans perceive the physical structure of the world, predict what will happen next, and plan successful interactions. What computations underlie such intuitive physical reasoning, and can physical reasoning be learned from the statistics of visual input alone? Recent self-supervised video models such as V-JEPA have claimed to learn physics by merely learning to predict the future in naturalistic videos, suggesting that this is in principle possible for humans. Like humans, V-JEPA is ‘surprised’ by videos showing physically implausible events. However, previously tested violations – for example, object ‘teleportation’ – are also deviations from natural low-level visual motion statistics, providing an alternative explanation to V-JEPA’s surprise. Thus, to unconfound low-level visual motion statistics from physical implausibility, we generated transformed versions of two synthetic physical scene understanding video benchmarks, GRASP (Jassim et al., 2023) and IntPhys (Riochet et al., 2020), by applying spatial (left-right and up-down flips) and temporal (time-reversal) transforms. These transforms all preserve motion statistics but differentially affect the underlying physics: left-right flips always maintain plausibility, while up-down flips and temporal-order reversals remove plausibility in videos involving gravity and/or momentum. On the transformed GRASP dataset, V-JEPA is not surprised by left-right flipped videos but is surprised by up-down flipped and temporal-order reversed videos, as expected if V-JEPA understands gravity and inertia. However, on the transformed IntPhys dataset, V-JEPA is surprised by left-right flipped instead of up-down flipped or temporal-order reversed videos, challenging earlier claims of the model’s physical scene understanding. In ongoing work, we are expanding our benchmark to make fine-grained comparisons between models and humans and testing it on other computational models. Overall, our results temper previously made claims that fundamental physical scene properties can be learned just from observing natural statistics, and leave open the question of what computations underlie flexible physical reasoning in humans.

Acknowledgements: NSF STC Award Number 2124136 to NK