Publikation

Multimodal integration of human-like attention in visual question answering

Ekta Sood; Fabian Kögel; Philipp Müller; Dominike Thomas; Mihai Bâce; Andreas Bulling

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. International Conference on Computer Vision and Pattern Recognition (CVPR-2023), Pages 2647-2657, IEEE/CVF, 2023.

Zusammenfassung

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to unimodal integration – even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN is competitive to state of the art in its model class – achieving 73.98% accuracy on test-std and 73.72% on test-dev with approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like attention into neural attention mechanisms for VQA.

Weitere Links

https://openaccess.thecvf.com/content/CVPR2023W/GAZE/papers/Sood_Multimodal_Integration_of_Human-Like_Attention_in_Visual_Question_Answering_CVPRW_2023_paper.pdf

Sood_Multimodal_Integration_of_Human-Like_Attention_in_Visual_Question_Answering_CVPRW_2023_paper.pdf (pdf, 5 MB )