Humans vs. AI: Assessing the similarity of task-specific scene descriptions from humans and generative AI
Poster Presentation 26.462: Saturday, May 16, 2026, 2:45 – 6:45 pm, Pavilion
Session: Theory
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Skylar Stadhard1, Gillian Rosenberg1, Bruce Hansen2, Michelle Greene1; 1Barnard College, 2Colgate University
Multimodal generative AI can provide rich scene descriptions, but we have not yet established how similar AI-generated descriptions are to humans'. To what extent can generative AI learn a broad array of scene semantics from statistical distributions of text and images? Are some tasks more difficult to mimic? We created a broad battery of 15 tasks, divided into five groups: general knowledge, affordances, sensory experiences, affective experiences, and mental simulation. Human participants and 15 multimodal AI programs provided descriptions for 40 images on each task. We embedded text outputs using three sentence embedders, and we computed cosine distances between human and AI embedding vectors. Additionally, we assessed descriptions with traditional NLP metrics. Human descriptions were shorter than AI-generated counterparts (14.4 versus 28.3 words, p<0.0001), but richer and more variable: they had higher lexical entropy (3.6 bits versus 3.4, p<0.0001) and type-to-token ratio (0.89 versus 0.84, p<0.0001), indicating humans used greater word variety. Cosine distances varied across models and tasks. Open-weight AI systems performed worse than closed-weight systems (mean cosine distances: 0.28 versus 0.21, p<0.0001). Critically, we observed significant differences across tasks, with general knowledge about scene and object categories being easiest for AI (mean: 0.22) and assessments of scene affordances being hardest (mean: 0.29). This was somewhat surprising, given human sensitivity to scene affordances (Greene et al., 2016). One possibility is that humans produce compressed affordance descriptions due to their sensitivity, and AI consequently learns less about them. Contrary to commonly-held stereotypes of AI, affective tasks were comparatively easy for generative AI systems (mean: 0.23). AI can increasingly approximate human descriptions for many tasks, even tasks that require inferences that transcend mere image content, such as predicting future scene appearance.
Acknowledgements: NSF 2522311