Skip to main content Skip to main navigation


Sequential Spatial Transformer Networks for Salient Object Classification

Fatemeh Azimi; David Dembinsky; Federico Raue; Jörn Hees; Sebastian Palacio; Andreas Dengel (Hrsg.)
International Conference on Pattern Recognition Applications and Methods (ICPRAM-2023), located at 12th International Conference on Pattern Recognition Applications and Methods, February 22-24, Lissabon, Portugal, scitepress, 2/2023.


The standard classification architectures are designed and trained for obtaining impressive performance on dedicated image classification datasets, which usually contain images with a single object located at the image center. However, their accuracy drops when this assumption is violated, e.g., if the target object is cluttered with background noise or if it is not centered. In this paper, we study salient object classification: a more realistic scenario where there are multiple object instances in the scene, and we are interested in classifying the image based on the label corresponding to the most salient object. Inspired by previous works on Reinforcement Learning and Spatial Transformer Networks, we propose a model equipped with a trainable focus mechanism, which improves classification accuracy. Our experiments on the PASCAL VOC dataset show that the method is capable of increasing the intersection-ver-union of the salient object, which improves the classification accuracy by 1.82 pp overall, and 3.63 pp for smaller objects. We provide an analysis of the failing cases, discussing different aspects such as dataset bias and saliency definition on the classification output.