A Population Coding Approach to Estimating Inter-Observer Consistency in Eye Fixations

Poster Presentation 36.416: Sunday, May 17, 2026, 2:45 – 6:45 pm, Pavilion
Session: Eye Movements: Models, remapping

Matthias Kümmerer1 (), Matthias Bethge1; 1Tübingen AI Center, University of Tübingen

How consistently do people look at the same things in a given image? The answer matters fundamentally for understanding visual attention, but standard methods systematically underestimate human consistency. We propose a population coding approach: rather than estimating fixation distributions from spatial data alone, we combine information across multiple sources—spatial, semantic, and contextual—to better capture the true structure of where observers look. The problem arises from data limitations: typical eye-tracking studies collect 10-20 fixations per image. Standard kernel density estimation worked well until recently. However the newest generation of image-computable saliency models outperform these methods, sometimes substantially. This results in unreliable identification of images where computational models fail to predict human behavior. Our approach combines three innovations. First, we employ adaptive bandwidth kernel density estimation (Abramson's method), which automatically adjusts spatial precision to capture both tight fixation clusters and broader distributions. Second, we incorporate semantic scene information through a mixture model, leveraging the fact that similar image features attract fixations across different images. Third, we optimize estimation parameters on a per-image basis. Results across multiple free-viewing datasets (MIT1003, CAT2000, COCO-Freeview) demonstrate substantially improved inter-observer consistency estimates, 10% on average and more than 50% on individual images. This demonstrates that saliency models still have potential to improve, and enables us to reliably identify where current saliency models fail to capture human viewing behavior. Such cases provide the clearest targets for investigating missing mechanisms in computational models of attention. We demonstrate extensibility by incorporating task context (free viewing vs. visual search) as an additional mixture component. In principle, task-specific fixation estimates should be more precise than task-averaged ones. However, conditioning on task further reduces the available data per image. By combining task-specific and task-general KDE components, we leverage the precision of task-specific information while benefiting from the statistical power of pooling across tasks.

Acknowledgements: This work was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, project number: 276693517.