Model-based simulation vs Model-free pattern matching: what computations underlie human action planning?
Poster Presentation 23.481: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Miscellaneous
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Aryan Zoroufi1, Nishad Gothoskar2, Leslie Kaelbling2, Joshua Tenenbaum1, Nancy Kanwisher1; 1Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
Vision enables us to understand our surroundings, predict future events, and plan actions. Advances in AI have produced computational models that serve as testable hypotheses for human visual cognition. In action planning, two main strategies have emerged. "Model-based" strategies estimate the world state (3D object shapes, spatial relationships, forces) from 2D inputs, forming beliefs about the environment and planning through mental simulation. While computationally expensive, model-based approaches generalize well to new scenarios. "Model-free" approaches learn direct mappings from sensory inputs to actions through pattern matching, typically as end-to-end neural networks. Model-free strategies are faster at test time but less adaptable to novel scenarios. Which framework better explains human behavior? We asked participants to search for occluded objects in virtual environments, examining which occluder they moved first. We implemented two algorithm classes: (1) model-based strategies that construct 3D scene representations from RGB images, simulate potential target locations, calculate reveal probabilities, and select the highest-probability occluder; and (2) model-free algorithms trained end-to-end to map directly from RGB images to occluder selection. Model-based algorithms better predict human behavior than model-free approaches. Specifically, a modular perception architecture—decomposing the task into separate components (depth estimation, segmentation, 2D amodal completion, and 3D object reconstruction) wired together—fit human behavior (r^2 = 0.33 where human reliability is 0.45) better than end-to-end networks trained directly from RGB to 3D reconstruction (e.g. r^2 = 0.07 for GPT5). These findings suggest human action planning relies on explicit 3D mental models and planning through mental simulation, rather than learning direct 2D-to-action mappings. The advantage of modularity in our models mirrors modular human visual cognition, where distinct perceptual processes combine to enable robust scene understanding and action planning.