The work investigates the problem of grounding---adding to common ground---in situated human-robot dialogues. Common ground, a special kind of mutual understanding among dialogue parties, is essential for any joint activity and its establishment is thus central to any interaction. When building a robot that is able to interact with a human, we need to tackle the same problem. Not only does the robot need to know what the human expects from it, it also needs to make sure that the human knows what the robot thinks. In other words, both the robot and the human need to work together in order to extend their common ground. We approach this problem by explicitly modelling the beliefs of the agents engaged in the dialogue as a collection of formulas in a formal language. The formulas are given semantics via a translation into a combined modal logic that accounts for their epistemic status, spatio-temporal framing and content. Three classes of epistemic statuses are modelled: private, attributed and shared. Private beliefs are private to the agent, attributed beliefs are beliefs about other agents' private beliefs and shared beliefs represent common ground. The process of grounding can is then formulated as the formation of shared beliefs from private and attributed ones. This process is based on Thomason et al.'s framework for dialogue management, which treats dialogue as one kind of a wider collaborative activity, using abductive reasoning to infer explanations of observed events. In its original formulation, Thomason et al. require what they call the Principle of Coordination Maintenance. The principle states that "what is said by the speaker is what is understood by the hearer". This is clearly too strong for direct usage. We remove this requirement by introducing the notion of assertion from multi-agent planning. An assertion is a statement that needs to be verified against the world state, but until this verification (or falsification) is supplied, its validity can be assumed. That is, if the means for the verification or falsification are supplied by a communication protocol, we can show that the protocol is able to achieve common ground. Our extension of Thomason et al.'s framework, which we call the Continual Collaborative Activity, defines such a protocol. Finally, we present an implementation of the system in a cognitive architecture of a robot in a scenario where the robot learns a correct model of a visual scene in a collaboration with a human tutor.