Probing scene understanding in long-term memory using generative models
Poster Presentation 23.330: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Scene Perception: Models, natural image statistics
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Ritik Raina1 (ritik.raina@stonybrook.edu), Abe Leite1, Alexandros Graikos1, Seoyoung Ahn2, Gregory Zelinsky1; 1Stony Brook University, 2UC Berkeley
What information constitutes the long-term memory (LTM) representation of a viewed scene? We use our Seen2Scene framework to identify metamers in LTM—generated scenes that are confused with previously-viewed images. Seen2Scene is a latent diffusion model that generates scenes from sparse inputs: fixation tokens (DINOv3 patches at fixated locations), a gist-like peripheral representation based on low-resolution information, and optionally text descriptions. For combined visual-text generations, custom attention processors adaptively weight contributions based on visual feature confidence—text features fill in where visual information is ambiguous. Seen2Scene was integrated into a behavioral experiment spanning two sessions. In Session 1, participants freely-viewed scenes for a variable number of fixations (1, 5, or 10), and then immediately provided a detailed verbal description via a microphone. After a minimum 1-day interval, participants returned for an old/new recognition task where test images included: (1) original scenes, (2) novel scenes not viewed in Session 1, (3) generations based on visual information alone, (4) generations based on verbal descriptions alone, or (5) generations combining both visual information and verbal descriptions. Vision-based generations produced the highest LTM metamerism rates, with the effect increasing with the number of fixations during Session 1 viewing. Verbal-only generations had low metamerism rates, performing only slightly above novel scenes. Adding verbal descriptions to vision-based generations did not increase metamerism rates. This implies that visual input during encoding, as captured by vision-based generations, primarily shapes our LTM scene representations, while verbal encoding contributes minimally. The retention interval revealed a striking crossover: metamerism from vision-based generations decreased with longer intervals, while metamerism from verbal-only generations increased. This pattern indicates that LTM representations initially reflect visual details, but may shift toward semantic content as perceptual details fade. Our findings reveal the multifaceted nature of LTM and demonstrate Seen2Scene as a powerful tool for probing scene memory representations.
Acknowledgements: RR and GJZ are supported by NSF-CompCog #2444540 to GJZ. AG is supported by NSF-IIS-2212046. AL is supported by NSF-GRFP #2234683 and NIH-NEI R01EY030669 to GJZ.