Object and Scene Recognition Abilities Predict the Content and Quality of Image Descriptions

Poster Presentation 56.424: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: Models

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

There is a Poster PDF for this presentation, but you must be a current member or registered to attend VSS 2026 to view it.
Please go to your Account Home page to register.

Melina O. Mueller¹ (melina.o.mueller@vanderbilt.edu), Ian G. Dobbins², Isabel Gauthier¹; ¹Vanderbilt University, ²Washington University in Saint Louis

Do non-verbal visual abilities predict how people describe what they see? We measured novel object recognition (o; simultaneous matching to silhouettes, sequential matching, and oddball tasks), scene recognition (sequential matching and oddball tasks with prepared food and outdoor images), and intelligence (g) in 101 adults, then asked them to describe 10 complex images (e.g., market, coffee shop, abstract sculpture), from memory or during perception. We examined the quality of descriptions via semantic similarity scores to a large language model’s descriptions and the content of descriptions via feature extraction (e.g., feature descriptors, hedging, spatial language). With hierarchical regression, we controlled for age, gender, vocabulary, description length, g (and o for scenes as a predictor). People who were better at recognizing scenes provided higher quality descriptions of images while viewing them (ΔR² = .093, p = .003) or from memory (ΔR² = .125, p < .001). Additionally, visual abilities predicted how people described images. Those with a higher o made more references to shape (ΔR² = .033, p = .048) and spatial relations (ΔR² = .040, p = .014), and used less hedging (ΔR² = .052, p = .016), in memory-based descriptions. Those with higher scene recognition made more references to texture (ΔR² = .062, p = .022) and to spatial relations (ΔR² = .035, p = .046) in memory-based descriptions, and used more adjectives in perception-based descriptions (ΔR² = .062, p = .012). Beyond intelligence and vocabulary, non-verbal perceptual visual abilities predict how we write about images, and the quality of these descriptions. One interpretation is that o supports encoding of richer visual representations that can be accessed during linguistic output. These results have theoretical implications for both the study of vision and that of language, and practical implications for what we can learn about people using natural language processing.

Acknowledgements: This work was supported by the David K. Wilson Chair Research Fund and NSF BCS Award 2316474

Vision Sciences Society

Object and Scene Recognition Abilities Predict the Content and Quality of Image Descriptions

Important Dates

MyVSS

Join VSS

Future Meetings