Efficient encoding of dynamic visual scenes based on elementary 3D features

Poster Presentation 43.312: Monday, May 18, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Features, parts

Serena Castellotti1,2, Antonio Brau1, Maria Michela Del Viva2, Giovanni Punzi1,3; 1Department of Physics, University of Pisa, Italy, 2Department of Neurosciences, Psychology, Drug Research and Child Health (NEUROFARBA), University of Florence, Florence, Italy, 3INFN, Pisa, Italy

Previous work has suggested that, given the need to rapidly process large amounts of visual data, the visual system extracts bottom-up saliency maps of static scenes using a limited set of features. These edge- and bar-like “optimal” 2D features can be derived by applying constrained maximum-entropy principles to the frequency distribution of all possible features. It is interesting to ask the same question in the spatio-temporal domain, as motion information constitutes a crucial saliency cue in early visual analysis. However, extending the approach to real-life dynamic scenes leads to an exponential increase in the number of possible 3D features, making it too demanding to be performed with standard computing means. We address this challenge by using specialized big-data reduction algorithms (Floating Top-k) adapted for execution on FPGA devices. This new approach makes it computationally feasible to identify optimal 3D features by applying the constrained maximum-entropy method to a much larger space of potential features than previously possible. We created movie sketches using optimal 3D features (3×3pixels×3frames) extracted from a large video database and tested their effectiveness in a discrimination task. Observers’ accuracy remained very high across a broad range of model parameters, supporting the robustness of the model in predicting salient motion features. As a control, we generated alternative sketches by selecting sets of 3D features from frequency ranges different from the constrained maximum-entropy selection, matching either its information content or its constraint values. In both cases, these alternative “non-optimal” selections yielded poorer discrimination performance, indicating that neither total information nor the constraints alone are sufficient to produce meaningful representations of moving scenes. Overall, these findings suggest that the visual system reduces complex dynamic inputs at an early processing stage by selecting a limited number of elementary features that maximize information transmission under computational constraints.

Acknowledgements: This project was funded by the European Union – Next Generation EU, in the context of the grant PRIN 2022 (Project: "Real time reconstruction of data from LHC experiments with a distributed FPGA system", Grant no. 2022Z3K93E, CUP: I53D23001540006).