Large language model performance in ophthalmic patient education: Amblyopia and age-related macular degeneration
Poster Presentation 23.476: Saturday, May 16, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Miscellaneous
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Sunwoo Kwon1,2,3, Artashes Yeritsyan2, Dennis Levi2,3,4; 1Exponent, 2Herbert Wertheim School of Optometry and Vision Science, 3Center for Innovation in Vision and Optics, 4Helen Wills Neuroscience Institute
Artificial intelligence (AI) chatbots are increasingly used for patient education, with 22% of Americans reporting seeking health advice from these tools. However, the reliability of chatbot-generated ophthalmic information remains unclear. We conducted a multi-condition evaluation of chatbot responses to patient-focused questions across two prevalent vision and visual-system disorders: amblyopia and age-related macular degeneration (AMD). We assessed the accuracy, comprehensiveness, and readability of these chatbot responses in comparison to the American Academy of Ophthalmology (AAO) and the American Optometric Association (AOA) patient materials. Two condition-specific question sets were constructed from AAO/AOA patient brochures: 21 amblyopia questions and 12 AMD questions. Each question was entered twice into six publicly available AI chatbots (ChatGPT-3.5, ChatGPT-4, Gemini, Meta AI, Snap AI, and Copilot), generating 252 amblyopia and 144 AMD chatbot responses. Using a 5-point Likert scale, amblyopia responses were rated by three optometrists with expertise in amblyopia while AMD responses were rated by five optometrists with expertise in retinal disease. Accuracy and comprehensiveness were analyzed using the Friedman and post hoc Wilcoxon signed-rank tests, while readability was analyzed using ANOVA and post hoc TukeyHSD tests. Due to differences in question sets and raters, datasets were analyzed independently and synthesized qualitatively. Across both ocular conditions, GPT-3.5, GPT-4, Copilot, and Gemini consistently produced more accurate and comprehensive responses than Meta AI, Snap AI, and AAO/AOA brochures. With the exception of Copilot, all AI chatbots produced significantly harder-to-read texts than AAO/AOA patient brochures for both ocular conditions. Our results demonstrate that across two ophthalmic conditions, AI chatbots, particularly GPT-3.5, GPT-4, Copilot, and Gemini, outperformed AAO/AOA materials in accuracy and comprehensiveness but displayed persistent readability challenges. Collectively, these findings delineate both the emerging capabilities and current limitations of AI systems in ophthalmic patient education.