Visual Working Memory: Influences, models

Talk Session: Saturday, May 16, 2026, 5:15 – 7:00 pm, Talk Room 2
Moderator: Brad Wyble, Penn State University

Talk 1, 5:15 pm, 25.21

Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision

Serena Wang1 (), Jie Jing1,2, Yongjian Huang1, Shuangpeng Han1, Lucia Schiatti3,4, Yen-Ling Kuo5, Qing Lin1, Mengmi Zhang1; 1Nanyang Technological University, 2Sichuan University, China, 3Massachusetts Institute of Technology, USA, 4Istituto Italiano di Tecnologia, Italy, 5University of Virginia, USA

We introduce intra-class memorability, where certain images within the same class are more memorable than others despite shared category characteristics. To investigate what features make one object instance more memorable than others, we design and conduct human behavior experiments, where participants are shown a series of images, and they must identify when the current image matches the image presented a few steps back in the sequence. To quantify memorability, we propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation. Furthermore, we curate the Intra-Class Memorability Dataset (ICMD), comprising over 5,000 images across ten object classes with their ICMscores derived from 2,000 participants' responses. Subsequently, we demonstrate the usefulness of ICMD by training AI models on this dataset for various downstream tasks: memorability prediction, image recognition, continual learning, and memorability-controlled image editing. Surprisingly, high-ICMscore images impair AI performance in image recognition and continual learning tasks, while low-ICMscore images improve outcomes in these tasks. Additionally, we fine-tune a state-of-the-art image diffusion model on ICMD image pairs with and without masked semantic objects. The diffusion model can successfully manipulate image elements to enhance or reduce memorability. Our contributions open new pathways in understanding intra-class memorability by scrutinizing fine-grained visual features behind the most and least memorable images and laying the groundwork for real-world applications in computer vision. We will release all code, data, and models publicly.

This research is supported by the National Research Foundation, Singapore under its NRFF award NRF-NRFF15-2023-0001 and Mengmi Zhang’s Startup Grant from Nanyang Technological University, Singapore

Talk 2, 5:30 pm, 25.22

How (not) to measure memorability: Underlying "memory axes" reveal that memorability studies often reflect response bias

Niels Verosky1, Brian Scholl1; 1Yale University

Memory for a visual stimulus depends not only on the observer, but also on the stimulus itself: some objects or events seem to stick in our minds much more readily than others. But just how can — and should — this type of visual *memorability* be measured? A tidal wave of recent work has almost always employed either 'corrected recognition' (hit rate minus false-alarm rate) or d' (as a measure of signal-detection sensitivity). Here we show how these measures actually capture a mixture of memorability *and response bias* — and in the worst cases may degenerate into pure measures of bias. We demonstrate this in three steps. First, we show how this problem can occur in principle, in constructed examples. Second, we show how it operates in practice, in a set of actual case studies from the published empirical literature — highlighting examples both where this problem is tempered, and where it is catastrophic. Third, we survey the extent of this problem across the literature, in an analysis of 34 previous memorability experiments. We find that the clear majority of experiments show an unexpected positive correlation between hits and false alarms (indicative of response bias), and that inadvertently measuring response bias is ubiquitous. To address the serious limitations of existing measures, we introduce a new way of measuring memorability based on estimating "memory axes". This approach makes explicit the underlying covariance structure of the data, using principal-components analysis to directly orthogonalize memorability from response bias. We suggest that future work on memorability should aim to separate response bias and memorability as distinct dimensions of stimulus-driven memory performance. Doing so may profoundly change the theoretical interpretation of visual memorability studies.

Talk 3, 5:45 pm, 25.23

Semantically-rich objects increase the distinctiveness of low-level visual features

Joseph M. Saito1 (), Yong Hoon Chung2, Viola S. Störmer2, Timothy F. Brady1; 1University of California San Diego, 2Dartmouth College

A longstanding question in the field of visual memory concerns the nature of stimulus representations that are stored in the absence of sensory input. Models inspired by the architecture of the visual system posit that individual stimuli are represented at multiple levels of complexity along the visual hierarchy, ranging from bundles of low-level features (e.g., colors, orientations) up to higher-level objects (e.g., fruits, animals). Critically, these models predict that binding similar low-level features to different object-level representations can reduce feature confusability when those objects activate additional semantic knowledge that increases the dimensionality of each memory. To test this prediction, we used a continuous estimation working memory paradigm where observers were asked to remember pairs of similar colors that were presented as part of semantically-rich real-world objects (e.g., backpack, camera) or their unrecognizable, scrambled counterparts (e.g., scrambled backpack, camera). To minimize any incidental long-term associations between specific objects and colors (e.g., red on an apple), we chose real-world objects that were color-neutral and assigned the colors to them randomly. Furthermore, to isolate the distinctiveness conferred specifically by the semantic richness of the objects, we used the same color pairs in each condition and closely matched the visual similarity between the shapes of the intact and scrambled objects using feature values extracted from the average pool layer of the convolutional neural network model, VGG16. In doing so, we found that observers committed fewer swap errors and exhibited weaker repulsion biases when the same color pairs were encoded as part of real-world objects than scrambled objects, consistent with hierarchical memory models. These findings suggest that semantically-rich object representations reduce the confusability between low-level features by enhancing feature binding. This enhanced binding, in turn, reduces the reliance on compensatory processes, like adaptive distortions, to preserve the distinction between similar features in memory.

This work was supported by NSF BCS-2146988 awarded to TFB

Talk 4, 6:00 pm, 25.24

Adaptive Computation in Working Memory: Goal-Conditioned Sparse Variational Gaussian Process Explains Retro-Cue Benefits and Interference Dynamics

Dongyu Gong1, Mario Belledonne1, Ilker Yildirim1; 1Yale University

Despite extensive research on working memory (WM) limits, an integrative algorithmic account that explains how finite neural resources are dynamically reconfigured to support goal-directed behavior remains elusive. We propose a novel algorithmic model of WM based on Sparse Variational Gaussian Processes (SVGP), where memory is conceptualized as a continuous density estimation problem and capacity limits arise from a finite set of "inducing points" (computational resources) used to approximate this function. In this framework, we model the maintenance of visual information as an online, goal-driven optimization process. We tested the model against human performance in a preregistered behavioral experiment (N=30) using a continuous color reproduction task with retro-cues. Participants memorized the colors of 2 or 4 items; after a delay, a spatial cue either indicated the target item (retro-cue) or no cue was provided. The SVGP model implements a novel "adaptive computation" mechanism: during the delay, top-down goals (cues) dynamically re-weight the model’s variational objective function. This forces the limited inducing points to migrate toward the cued location in representational space, sharpening the fidelity of the target while degrading non-targets. The model quantitatively predicts three key empirical signatures observed in our human data: 1) A robust retro-cue benefit (faster RT and reduced error) driven by the reallocation of inducing points; 2) A set-size effect where precision degrades as inducing points are stretched across multiple items; and 3) complex interference patterns where spatial proximity interacts with feature similarity. Moreover, the model reproduces the WM distortion phenomena reported by Chunharas et al. (2022), capturing the transition from attraction, which dominates when capacity is overloaded (e.g., high set sizes), to repulsion, which emerges when capacity is sufficient (e.g., low set sizes). By formalizing WM as an online, goal-driven density approximation process, this work offers an algorithmic bridge between neural dynamics and flexible cognitive control.

Talk 5, 6:15 pm, 25.25

Modelling the compositional and generative aspects of visual working memory

Brad Wyble1, Ian Deal1; 1Penn State University

Visual working memory (VWM) is conceived as holding multiple objects in service of other cognitive tasks. We develop neurocomputational mechanisms that allow compositional and generative processes to expand on this traditional view. Compositionality allows objects or scenes to be decomposed into constituents that can be manipulated or recombined to form new representations. Generative processing allows conceptual information to be reconstructed in a format akin to visual sensory representations that can be re-processed by perception. Together, compositional and generative mechanisms enable the traditional memory comparison functions of VWM as well as forming a basis for visual imagery and some aspects of creativity. Our model uses a modified variational autoencoder to simulate visual processing of distinct representations of location, size, color and shape features, that are extracted from a visual canvas. These representations are then selectively stored in a binding pool that groups features into units akin to pointers or tokens. Once stored, memories can be selectively pushed back into the processing hierarchy to regenerate visual forms. This model exhibits typical effects of working memory storage including set size effects, loss of precision with load, and feature swaps. It can also store individuated copies of repeated items and novel shapes, but exhibits better memory for familiar shapes. The model also connects visual patterns with categorical abstractions of shape and color thereby allowing it to classify visual forms and even to generate new visual forms based on top-down instructions. Our model suggests that generative and compositional properties provide the functionality of both visual memory and visual imagery. The model can store multi-feature objects, selectively modify one feature without modifying others, compose simple visual scenes piecemeal and then recognize patterns in those created scenes. These results provide concrete computational instantiations of basic cognitive mechanisms that allow a much wider scope of functionality.

This work was supported by NSF grant BCS-2216127

Talk 6, 6:30 pm, 25.26

Visualizing Scene Understanding in Short and Long-term Memory

Gregory Zelinsky1, Ritik Raina1, Abe Leite1, Seoyoung Ahn2; 1Stony Brook University, 2University of California, Berkeley

At VSS25, we introduced a real-time method for identifying AI-generated scenes that viewers cannot discriminate from originals—scene metamers. Here we show how our method enables the testing of previously untestable hypotheses about scene representation. We evaluated the hypothesis that long-term memory (LTM) representations may be simply language-based generations, whereas short-term memory (STM) representations are primarily visual. We tested this hypothesis by generating images under these conditions and comparing metamerism rates. To obtain scene generations, we introduce Seen2Scene, a generative AI model that combines fixation tokens (DINOv3 patches) with a gist-like scene representation (from blurred peripheral pixels) to generate plausible proxies for scene understanding. Experiment 1 obtained STM metamers by showing participants (n=47) each scene (n=300) for 1-10 fixations, followed by an 8-second delay during which Seen2Scene generated another scene from the participant’s viewing behavior. We then briefly (200 msec) presented either the originally viewed scene or the generation and asked the participant to make a same/different judgment. Metameric scenes were generations incorrectly eliciting “same” responses. Experiment 2 obtained LTM metamers for the same scenes using the same Seen2Scene model. Participants first viewed a scene (1-10 fixations) and then immediately reported (verbally, via microphone) a detailed description of the just-viewed scene. After a minimum 1-day delay, participants returned to the lab where they were also shown scenes (200 msec) in a same/different task. Different images were either actually different or were generations from the participant’s verbal description, plus baseline variants. We found high metamerism rates in Experiment 1, and much lower metamerism rates in Experiment 2 when scenes were generated using language alone. Adding vision to language, however, resulted in metamerism rates jumping to Experiment 1 levels, suggesting that LTM scene representations are not generated from language alone and require visual representation.

RR and GJZ are supported by NSF-CompCog #2444540 to GJZ. AL is supported by NSF-GRFP #2234683 and NIH-NEI R01EY030669 to GJZ.

Talk 7, 6:45 pm, 25.271

From binding to bias: learned stimulus–action associations shape visual memory

Cate Trentin1,2 (), Chris N.L. Olivers1,2, Heleen A. Slagter1,2; 1Vrije Universiteit Amsterdam, 2Institute for Brain and Behaviour Amsterdam (iBBA)

Theories propose that visual representations, both perceptual and mnemonic, are organized not merely by their visual properties, but also by their relevance for action. Consistent with this view, we recently found that two items held in visual working memory are remembered as more dissimilar when mapped to two different actions than when mapped to the same action. This action-dependent distortion arose very quickly, when mappings varied on a trial-by-trial basis. Here, in two experiments (N=48 each), we examined whether stable stimulus-action mappings further amplify this representational bias. On every trial, participants memorized the orientations of two bars and later reported them on a touchscreen by performing a cued action (i.e., grip or slide). For half of the stimulus pairs, action mappings were fixed throughout the experiment. For the remaining pairs, action mappings varied randomly across trials. Learning of the fixed contingencies was assessed afterwards. In Experiment 1, we replicated our earlier finding: For the randomly varied mappings, orientations linked to different actions repelled each other more than those linked to the same action. Critically, this repulsion effect was enhanced for the fixed mappings, but only for participants who had learned the contingencies. To increase learning, in Experiment 2 we reduced the number of contingencies. We again observed stronger action-based repulsion for learned (fixed) mappings than for random mappings, now for the full group. These findings indicate that as stimulus-action associations become consolidated through learning, they increasingly shape visual memory representations.