A software agent who gives incremental, i.e. step-by-step, route descriptions while moving through an environment is an interesting starting-point for an integrated view on visual perception and natural language generation. We present a computational model, called MOSES. In particular we show how visual data is transformed into visuo-spatial representations. An object selection process based on visual features starts at a high-level description of objects in a synthetic three dimensional environment. We found by experiments that incremental route descriptions can be classified by a small set of syntactoc and semantic structures. By consideration of temporal constraints, visuo-spatial structures, path-related intentions, as well as rhetorical abilities of the speaker, a selection process extracts description schemata as input for language generation process. These schemata are modeled by a modified subset of Jackendoff's conceptual semantics formalism.