We describe our attempt to integrate multiple AI components such as planning, knowledge representation, natural language generation, and graphics generation into a functioning prototype called WIP that plans and coordinates multimodal presentations in which all material is generated by the system. WIP allows the generation of alternate presentations of the same content taking into account various contextual factors such as the user's degree of expertise and preferences for a particular output medium or mode. The current prototype of WIP generates multimodal explanations and instructions for assembling, using, maintaining or repairing physical devices. This paper introduces the task, the functionality and the architecture of the WIP system. We show that in WIP the design of a multimodal document is viewed as a non-monotonic process that includes various revisions of preliminary results, massive replanning and plan repairs, and many negotiations between design and realization components in order to achieve an optimal division of work between text and graphics. We describe how the plan-based approach to presentation design can be exploited so that graphics generation influences the production of text and vice versa. Finally, we discuss the generation of cross-modal expressions that establish referential relationships between text and graphics elements.