Skip to main content Skip to main navigation


Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks

Federico Raue; Wonmin Byeon; Thomas Breuel; Marcus Liwicki
In: Tarek R. Besold; Artur d’Avila Garcez; Gary F. Marcus; Risto Miikkulainen (Hrsg.). Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches. Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (CoCo-2015), located at NIPS 2015, December 11-12, Montreal, Canada, CEUR Workshop Proceedings, 2015.


The problem of how infants learn to associate visual inputs, speech, and internal symbolic representation has long been of interest in Psychology, Neuroscience, and Artificial Intelligence. A priori, both visual inputs and auditory inputs are complex analog signals with a large amount of noise and context, and lacking of any segmentation information. In this paper, we address a simple form of this problem: the association of one visual input and one auditory input with each other. We show that the presented model learns both segmentation, recognition and symbolic representation under two simple assumptions: (1) that a symbolic representation exists, and (2) that two different inputs represent the same symbolic structure. Our approach uses two Long Short-Term Memory (LSTM) networks for multimodal sequence learning and recovers the internal symbolic space using an EM-style algorithm. We compared our model against LSTM in three different multimodal datasets: digit, letter and word recognition. The performance of our model reached similar results to LSTM.