Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
Poster Presentation 36.329: Sunday, May 17, 2026, 2:45 – 6:45 pm, Banyan Breezeway
Session: Spatial Vision: Natural images, texture
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Alexa Tartaglini1 (alexart@stanford.edu), Daniel Wurgaft1, Satchel Grant1, Christopher Potts1, Judith Fan1; 1Stanford University
Understanding any data visualization requires integrating multiple types of visual input (i.e., text, numerals, shapes, curves) to draw appropriate inferences. Current vision-language models (VLMs) are, in principle, promising candidates for modeling the fundamental computations that underlie data visualization understanding. However, these models still struggle on basic data visualization understanding tasks (Verma et al., 2025), and the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty in these models (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs (LLaMA-3.2, LLaVA-OneVision, InternVL3). To diagnose the sources of errors produced by these models, we used activation patching and linear decoding to trace information flow through each model component. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially, suggesting that the downstream mathematical reasoning steps performed in the language module are sound. Moreover, even when the model generates an incorrect response, the correct coordinates can be reliably decoded from latent representations in the visual encoder, suggesting that the source of these errors lies not in limitations within the visual encoder, but in the vision-language handoff. Together, these findings point to some of the opportunities and challenges presented by VLMs as a basis for developing input-computable models of data visualization understanding.
Acknowledgements: This work was supported by NSF CAREER Award #2436199, NSF DRL #2400471, and awards from the Stanford Human-Centered AI Institute (HAI) and Stanford Accelerator for Learning.