Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Poster Presentation 36.329: Sunday, May 17, 2026, 2:45 – 6:45 pm, Banyan Breezeway
Session: Spatial Vision: Natural images, texture

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

Alexa Tartaglini¹ (alexart@stanford.edu), Daniel Wurgaft¹, Satchel Grant¹, Christopher Potts¹, Judith Fan¹; ¹Stanford University

Understanding any data visualization requires integrating multiple types of visual input (i.e., text, numerals, shapes, curves) to draw appropriate inferences. Current vision-language models (VLMs) are, in principle, promising candidates for modeling the fundamental computations that underlie data visualization understanding. However, these models still struggle on basic data visualization understanding tasks (Verma et al., 2025), and the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty in these models (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs (LLaMA-3.2, LLaVA-OneVision, InternVL3). To diagnose the sources of errors produced by these models, we used activation patching and linear decoding to trace information flow through each model component. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially, suggesting that the downstream mathematical reasoning steps performed in the language module are sound. Moreover, even when the model generates an incorrect response, the correct coordinates can be reliably decoded from latent representations in the visual encoder, suggesting that the source of these errors lies not in limitations within the visual encoder, but in the vision-language handoff. Together, these findings point to some of the opportunities and challenges presented by VLMs as a basis for developing input-computable models of data visualization understanding.

Acknowledgements: This work was supported by NSF CAREER Award #2436199, NSF DRL #2400471, and awards from the Stanford Human-Centered AI Institute (HAI) and Stanford Accelerator for Learning.

Vision Sciences Society

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Important Dates

MyVSS

Join VSS

Future Meetings