Perceptual Organization: Segmentation and grouping

Talk Session: Saturday, May 16, 2026, 8:15 – 9:45 am, Talk Room 2
Moderator: Jeremy Wilmer, Wellesley College

Talk 1, 8:15 am, 21.21

Ensemble feature distributions support visual stability

Vladislav Khvostov1, Julie Golomb1; 1Department of Psychology, The Ohio State University

To bypass attentional/working memory limitations, the visual system can represent statistical information about object groups (ensembles). Recent work revealed that observers explicitly represent feature distributions, i.e., relative frequencies of each feature value in the visual field. However, these feature distribution representations are abstract: by pooling features within the entire visual field, the visual system loses information about individual objects’ locations. Might this ability help explain the visual stability paradox (people’s perception of the world as stable despite drastic changes in the retinal positions of visible objects due to eye/head movements)? To test this, we had observers perform a transsaccadic change detection task. Observers viewed a display containing 36 colored disks (500ms), then executed a saccade, during which the display changed. Observers reported the number of disks that changed color between the two displays. In the Distribution Change condition, the color changes altered the shape of the color distribution (along the Gaussian-Bimodal spectrum, identical mean/range). In the Spatial Swap condition, we preserved the exact ensemble color distribution but swapped the spatial positions of individual colors. These conditions were matched for the number of disk changes and the total item-wise color difference. In the Spatial Swap condition, observers reported very few changes, even in the most extreme case, where 78% of the disks changed color, indicating that as long as the ensemble feature distribution is preserved across two scenes, participants perceive the environment as stable. In contrast, observers were highly sensitive to transsaccadic changes in the Distribution Change condition. We conclude that ensemble feature distribution representations play an important role in visual stability: because people make frequent eye/head movements, it may be more beneficial for the visual system to represent ensembles as feature distributions and not store individual members’ locations. As a result, perception of visual stability relies on feature distribution representations.

Supported by grant NIH R01-EY025648 (JG)

Talk 2, 8:30 am, 21.22

Principles of Local and Global Grouping that Underlie Segmentation of Natural Texture Images

Wilson Geisler1, Abhranil Das1,2; 1University of Texas at Austin, 2Brain Institute UFRN

Humans have a remarkable ability to segment natural scenes into physically meaningful regions. This sophisticated process contains many components, some “high level” (exploiting recognition of materials, objects and/or scene context), and others “low level” and largely independent of specific prior experience. The low-level components are essential in that they are required for initiating recognition processes, and for learning to recognize new materials, objects, and contexts. Our aim is to develop hierarchical Bayesian observer (HBO) models of natural texture segmentation that are biologically plausible, account for the statistical properties of natural scenes, and do not depend on prior experience. The current HBO model consists of five grouping steps consistent with Gestalt principles: 1. local similarity grouping with local normalization, 2. mutual similarity grouping (local grouping is strengthened if the neighboring regions are similar to the same set of other regions), 3. transitive grouping (good continuation), 4. confidence grouping (neighboring regions far from the same-different decision boundary guide grouping of regions near the boundary), and 5. region grouping (similarity grouping of the regions from the initial segmentation). We find that the local similarity grouping process, trained to maximize accuracy based on natural scene statistics and Bayesian decision processes, predicts human local similarity grouping accuracy. We then find that the four additional steps are able to accurately segment images with randomly-shaped regions containing arbitrary natural textures. The success of the model depends on all of the steps, but especially on local-similarity grouping and transitive grouping. We also find that the transitive grouping allows correct segmentation of non-stationary (e.g., slanted in depth) texture regions. Further, we find that when illumination varies across the image, local normalization enables both correct texture segmentation and estimation of the illumination change. Finally, we show that state-of-the-art deep-network models fail on these stimuli where our model succeeds.

Talk 3, 8:45 am, 21.23

Representations of Single Objects, Homogenous Ensembles, and Heterogeneous Ensembles

Patxi Elosegi1, Boyang Hu1, Marvin Chun1, Yaoda Xu1; 1Yale University

Visual scenes are inherently hierarchical, containing discrete objects (e.g., a bird) and collections of objects that can be efficiently summarized using ensemble statistics (e.g., the average motion of a flock of birds). Although prior psychophysical work shows that objects and ensembles can be perceived in parallel, it remains unclear how they are represented together in the brain, especially in comparison to heterogeneous ensembles with mixed categories of objects. In Experiment 1, we used a block-design fMRI paradigm and presented participants either a single object from one of three categories or an 8-item homogeneous ensemble comprised of one of three object categories. Participants performed an orthogonal image-jitter-detection task while viewing the stimuli. Using multivoxel-pattern analysis, we examined responses from functionally-defined early visual areas, object-selective areas, scene-selective areas, and posterior parietal cortex. Across all ROIs, decoding of single object categories generalized strongly to ensemble categories and vice versa, suggesting shared representations. Strikingly, adding a single outlier to an otherwise homogeneous ensemble significantly altered the representation compared to those of single objects and homogeneous ensembles, as revealed by limited cross-decoding from the latter two to the former. To further characterize the neural representation of heterogeneous ensembles, in Experiment 2, we presented stimuli containing nine ratios of objects from two categories, including homogeneous ensembles (8:0 and 0:8) and all heterogeneous combinations (7:1, 6:2, 5:3, 4:4, 3:5, 2:6, and 1:7). Across all ROIs, we replicated the main findings of Experiment 1. Critically, a U-shaped representational structure emerged for the heterogeneous ensembles, indicative of an orthogonal encoding of object category and ratio. Together, these findings show that single objects and homogeneous ensembles share neural representations, whereas heterogeneous ensembles utilize a coding scheme that preserves information about both the constituent object categories and their relative proportions.

Supported by NIH Grant R01EY030854 to YX.

Talk 4, 9:00 am, 21.24

From representation to reconstruction: temporal signatures of generative assembly

Yaxin Liu1 (), Yuval Hart2, Adam Green1; 1Georgetown University, 2Hebrew University of Jerusalem, 3Georgetown University

Humans rapidly reconstruct object parts and relations from mere silhouettes. Yet this "inverse graphics" problem is computationally intractable given the combinatorial explosion of part configurations. How do humans efficiently solve this vast search space in real time? Prior work has mainly examined how humans and models infer a single global shape or category, leaving open how multi-step generative reconstruction occurs. We hypothesized that efficient generation relies on detecting high-constraint parts (e.g., sharp vertices with unique geometry) that act as "anchor points" to reduce the search space. By contrast, low-constraint and ambiguous parts may demand internal mental simulation or offload computation to continuous sampling. To test this, 90 participants reconstructed persons, animals, objects, and geometric shapes from silhouettes using a computerized Tangram task—a generative assembly task. Participants used translations, rotations, and flips to match the target contours. We recorded high-resolution mouse trajectories to analyze the fine-grained temporal profile of real-time reconstruction. For each target contour, we computed a constraint index capturing the number of valid part-to-contour mappings (high constraint: anchors with near 1:1 mappings; low constraint: ambiguous regions with many valid decompositions). We found that high-constraint regions triggered short-latency, ballistic bursts of actions clustered in time. Conversely, low-constraint regions elicited threefold longer pauses and non-clustered moves, consistent with a shift to slower mental simulation over multiple competing assemblies. These findings suggest that generative reconstruction may be neither purely feedforward nor purely analysis-by-synthesis. Rather, high-constraint anchors are used to rapidly prune the hypothesis space, allowing slower, simulation-based mechanisms to resolve ambiguities without the need for continuous physical sampling.

Talk 5, 9:15 am, 21.25

When two become one: A “same-object advantage” that spans across shadows and their casting objects

Albert Z. K. Li1, Dominic Alford-Duguid1, Joan Danielle K. Ongchoco1; 1University of British Columbia

Visual experience is populated by “things” (solid objects) and “stuff” (liquids or granular substances). But then there are shadows. They are informative, but immediately discounted. They can carry the structure of their objects, but are deformable and often distorted. Perhaps most puzzling: shadows may be segmented as bounded regions that give rise to object-based effects, yet ultimately, cannot exist in the physical world without their casting objects. What is the relationship between shadows and their objects in visual processing? To explore this, we used a classic test of object-based representations: the same-object advantage. Observers viewed 3D-rendered scenes with a light, a casting object, and a receiving surface. Two probes flashed quickly, either with the same or different orientations, and observers reported whether they were identical. Observers were more accurate when probes appeared on the same object (versus when split between the object and surface) and on the same shadow (versus when split between the shadow and surface), replicating the traditional same-object advantage. But curiously, the usual cost of switching between object representations was not present when probes were split between the shadow and its object — a “shadow-object advantage” — which should not occur if shadows and objects are being treated as “equals” (i.e., separate bounded regions). This shadow-object advantage held even when the shadow was incongruent with its object (perhaps because once tagged as the object’s shadow, its exact shape is discounted), and remarkably, even when it was physically separated from its object (by being projected onto a wall). And when we “break” the shadow (by replacing it with an outlined dark mask of the identical shape and position), this shadow-object advantage then disappears. Altogether, these results suggest a new visual category: shadows as *derivative individuals* may be bound together in visual processing with their casting objects.

Talk 6, 9:30 am, 21.26

Graphs of averages exaggerate and sow disagreement

Jeremy Wilmer1 (), Sarah Kerns1,2, Lily Widdup3, Ken Nakayama3,4; 1Wellesley College, 2Dartmouth College, 3Harvard University, 4UC Berkeley

Scientific communication must be both valid (accurate) and reliable (consistent). Graphs are among the most widely used tools for conveying scientific evidence, yet we show that a very common practice—displaying only group averages—systematically fails on both counts. Using a novel drawing-based response measure, participants sketched the data points they believed underlay plotted averages. Across 40 replications spanning plot types, data domains, and levels of quantitative expertise, interpretations of average-only graphs were both severely exaggerated and strikingly inconsistent. Perceived effect sizes frequently exceeded plausible scientific bounds, and disagreement among viewers interpreting the same single graph was greater than that typically observed across entire research literatures. In most tested cases, interpretations were systematically less accurate than blind guesses, indicating that graphs of averages can satisfy the formal criteria for misinformation. We identify a simple and effective remedy: displaying individual data points alongside the mean. Contrary to concerns that increased visual detail would overwhelm or confuse viewers, data-showing graphs produced substantially greater accuracy, stronger consensus among viewers, and more positive subjective evaluations. Together, these results demonstrate that a widely used visualization practice can actively distort scientific understanding and that routinely showing raw data offers a practical, evidence-based improvement to the validity and reliability of scientific communication.