In the last few years, within cognitive science, there has been a growing interest in the connection between vision and natural language. The question of interest is: How can we discuss what we see. With this question in mind, we will look at the area of incremental route descriptions. Here, a speaker step-by-step presents the relevant route information in a 3D-environment. The speaker must adjust his/her descriptions to the currently visible objects. Two major questions arise in this context:1. How is visually obtained information used in natural language generation? and 2. How are these modalities coordinated? We will present a computational framework for the interaction of vision and natural language descriptions which integrates several processes and representations. Specifically discussed is the interaction between the spatial representation and the presentation representation used for natural language descriptions. We have implemented a prototypical version of the proposed model, called MOSES.