Wolfgang Broll, Eckhard Meier, Thomas Schardt
GMD - German National Research Center for Information Technology
Institute for Applied Information Technology (FIT)
In existing shared virtual environments the representation of participants is often limited to static avatars at the user's viewpoint. Some systems additionally allow users to perform a set of predefined actions on their avatar. This approach however, does not seem to be suitable to represent non-verbal communication. Additionally shared virtual worlds often do not attract new users if they are not already heavily populated.
In this paper we will show how we try to
overcome these problems by enhancing the user's representation through
tangible interfaces and populating shared virtual worlds with user
representatives, even if the user is currently not participating at the
Several shared virtual environments based on the Internet have been established through the last years   . Most of them however, use a very simple and rather static representation of their participants by avatars. These characters are simply placed at the current viewpoint of the particular user. In such a world the users role is basically reduced to that of an observer. People visiting these environments do not really become involved into these worlds, because their acting capabilities are often limited to walking around, following internet links, and talking to other people. Sometimes simple activities are available, but often are limited to a small, restricted area of the world.
The lack of interaction capabilities makes it impossible for the participants of shared virtual environments to become an active and constructive member of these worlds. Therefore the world's main structure as well as the appearance of the avatars is static and fixed over time. To encounter the problem of inexpressive representations of current visitors, some virtual environments try to add additional signs of life to their avatars. In AlphaWorld  and in Sony's Community Place  for instance, text typed in the chat window, is represented above the corresponding avatar. Other environments such as OnLive Traveller provide a basic speech-to-lips synchronization and a number of predefined expressions or body gestures. Most participants however, do not use such predefined actions, since they have to be activated explicitly (e.g. by pressing an appropriate button), rather than being captured from the real body motions and facial expressions of the user.
Another phenomena we can observe is that shared virtual worlds often grow lonely. Similar to the real world, users tend to group together to communicate. But why should people meet in virtual worlds, if there is no interesting theme to talk about? Most internet based environments do not serve a particular purpose. Users cannot profit of the virtual environment's structure, because no real task or goal is embedded into these scenes. People are forced to hang around without any chance of action, just waiting for other users to join the world. This may explain, why such environments are often populated rarely and finally turn into virtual deserts.
This does not necessarily have to be true as demonstrated by shared gaming environments such as Ultima Online . In this type of role playing games there exists a global task to perform (most often, it is to fight the evil). This goal is embedded into a complex, consistent scenery in which players are able to control their avatars. The main aspect of these games is to reach certain goals, develop strategies, extend capabilities and so on. There is always something to do or to discover in these worlds. We can observe that participants of these games in general become an active part of the game's story by controlling their avatars in a dedicated way. This may explain why users spend several hours in a shared gaming environment, but often leave other worlds after a couple of minutes.
Similar to gaming environments, our approach takes advantage of this kind of persistent, task oriented virtual worlds. Within a global, context dependent scenario, we are able to enhance static user representations by active avatar components. But instead of concentrating on the virtual world's internal state, it is our goal to emphasize these environments by real world user activities. These activities are continuously mapped onto corresponding virtual characters representing the user in an appropriate, context sensitive way. Further more, the combination of characters and real world events allows us to control avatars in the virtual environment and keep them alive, even if the actual persons currently do not actively participate within the world. Thus users acting in their usual working environment (e.g. editing a document, reading a book, walking to another persons office, etc.) can be represented appropriately by their virtual counterparts. To realize this approach simple sensor input as well as a number of system events have to be monitored and mapped to virtual world avatar or object actions.
In the second section of this paper we will
describe the infrastructure used to map external events and sensor
input to object or avatar specific symbolic actions. In the third
section we will present two sample scenarios.
In this section we will give a short introduction of the basic infrastructure used to realize symbolic actions in shared virtual environments. Our sample scenarios are based on SmallTool - a toolkit for the development of shared multi-user VR applications. We will further present our approach to control the behavior of avatars and characters within a virtual environment. The support provided by our approach can be subdivided into two areas:
SmallTool  multi-user VR toolkit is based on a set of libraries to minimize the necessary effort to create distributed virtual environments populated by users and characters. The main parts of SmallTool include:
The extended VRML library enables us to parse and render 3D objects based on the ISO standard VRML'97 . It provides additional support for the representation of users by avatars. In addition to theses features the EV library supports the synchronization of shared scenes.
The distributed worlds transfer and communication protocol DWTP provides a high level application network interface adapted to the special needs of shared virtual environments on the Internet. In addition to its client interface, it provides a number of services which can be used as daemons or within application servers to realize scalability, reliability and persistence.
The device independent communication interface finally supports the easy connection of new innovative I/O devices via the Internet. Each I/O device is connected to a DICI server part, which makes the device available to all or selected hosts on the Internet. Applications which want to use these services (either receiving input data or sending output data) simply include the DICI client interface. The services can then be used by specifying the Internet address and the name of the requested service. We have realized DICI servers for 6DOF magnetic trackers and the MOVY tracking system . MOVY is a wireless inertial tracker developed in our institute. Compared to magnetic tracking devices it has a wider operation area and is not influenced by metal or electric fields.
The SmallTool libraries are currently available for Windows95/98/NT, Linux and some UNIX flavors (IRIX, SOLARIS).
We have built some sample applications on top of the SmallTool libraries. Our main application is SmallView, a multi-user VRML browser. For rapid prototyping of new 3D applications SmallView provides an application scripting interface.
The main purpose of symbolic actions is to map external events on changes within a virtual world. Events based on user activities will usually be mapped to actions of the user's avatar or character. They may however also be used to map arbitrary events to changes of objects in a synthetic virtual environment.
We distinguish between the avatar and the character of a user: The avatar visualizes the viewpoint of a remote user to all other users, giving the participant a virtual representation within a shared multi-user environment. In contrast to avatars, a character can be used to represent particular activities of a user even when the user is not participating at the virtual environment. Thus the representation by a character can even be used in single-user virtual worlds.
We have realized a prototype implementation of a symbolic action module, which allows us to map external events on user or scene specific actions. These actions may include e.g. animations, object changes or sound. The symbolic actions can be configured by defining a simple mapping between the received external event data and the event recipient within the 3D scene. Additionally the internal events issued can be based on the previous external event (e.g. for stopping the last action). An external event may even be mapped to several internal events. Internal events can be issued concurrently or in sequence. An examples for a concurrent action is a character walking from one room into another room: the walking animation which moves the arms and legs of the character has to be performed concurrently to the movement of the whole character body to the new location. Other action such as sitting down and reading a book require two or more events to be issued in sequence. Often a certain gap between two actions is required or an action needs a minimum time to be performed (e.g. when changing locations). Time out values can be used to specify the duration of actions.
In our prototype the symbolic action module is
currently limited to match external events to VRML events. It is not
yet possible to query the state of objects in the VRML scene or to find
e.g. the nearest object of a particular type. The current version is
completely implemented as a SmallView application script. Future
releases will be based on an additional library integrated into
Our SmallView browsers provides the capability to load and run several external scripting applications. We currently use an extended C++ version (instead of Java) of the standard VRML EAI to transfer the events issued from the symbolic action module into the VRML scene graph. Animations sequences for particular objects types (defined by VRML prototypes) are started by sending appropriate time events to the corresponding VRML time sensor nodes. To achieve this, objects are defined by VRML prototypes. An object which wants to provide an external behavior interface, has to define an event input for each action. This mechanism allows us to define behaviors which are independent of the spatial context such as walking, shaking the head or jumping. In addition to the VRML standard, our implementation allows us the inheritance of prototypes. By overloading the interface of prototypes in sub-classes, this can be used to simulate dynamic binding (similar to the concept used in VRML++ . Additionally this concept provides us with a mechanism required to realize context sensitive behavior such as walk into the dining room or sit on chair. To identify the type of objects however, they would have to be inherited from the appropriate base object (e.g. a dining room from a room).
In addition to this mechanism, our realization allows us to add arbitrary input devices to VRML scenes. In addition to the mapping provided by the symbolic action module, named external events can be grabbed from objects within in the 3D scene. Once caught by a scene object, the event data can be forwarded to other parts of the scene by the standard routing mechanism of VRML. This provides the possibility e.g. to connect a 6DOF tracking system to the limbs of your avatar.
More complex representations of body motion
would require kinematics currently not possible within standard VRML.
Simple extensions to the specification of object transformations within
the scene could be used to specify joints. By limiting the direction
and amount of translations, or the axis and angle of rotations, inverse
kinematics can be used to calculate the intermediate transformations.
In this section of the paper we will present two
sample scenarios which have been realized within SmallView by the
mechanisms presented in the second section. The first scenario shows
how mutual awareness in virtual teams or between remote users (e.g.
tele-workers) can be increased by symbolic character representations.
The second part presents our approach to enhance the communication
between distributed users represented by avatars by providing
additional feedback mechanisms.
In many companies people working together are distributed over several rooms (or even floors or buildings). Some colleagues may even work from home. Since most of the work has to be done individually, colleagues get together only from time to time to synchronize their work, decide what to do next, exchange information, or simply talk to each other.
Back in their offices these coworkers only have the state of information of the last meeting, although a lot of things could have happened since this meeting: a shared document could have been finished or manipulated; a colleague could have added some information in a shared workspace, or some people may decide to have a spontaneous meeting in the coffee room. Informing every member of a working group of such events produces a communication overhead: Having modified or replaced a shared document, one has to send an email to all interested people. Spontaneous discussions are interrupted, because missing colleagues have to be invited explicitly.
The goal of our system is to capture such events and inform other people directly. Existing approaches to this problem send an email or pop up a window on the user's desktop. This however, seems to be too intrusive to achieve a peripheral awareness of the environment. In our opinion it is important to provide additional peripheral awareness similar to BT's Contact Space or Form Meeting Space . Our approach therefore is based on a comprehensive virtual workgroup scenario including the representation of users by active avatars, appropriate visualizations of the users' real working environment (e.g. offices, coffee room) as well as virtual workspaces. These virtual workspaces may be used to represent teams or subject related data. Ideally the 3D representation should be displayed on a separate screen rather than in an additional window on the user's desktop. This allows the user to concentrate on his work while staying aware of the overall situation in his workgroup.
Within these workgroup scenarios each member is represented by an animated avatar. The avatar's behavior represents the activity of it's owner symbolically: If an employee opens a shared workspace and fetches a document, his avatar moves to the room representing this workspace, takes a paper out of a bookshelf, moves back into his office, sits down and modifies this document. If two users are talking in the coffee room, their avatars also move to the corresponding virtual counterpart.
In our prototype we use this system to visualize the activity of users in BSCW workspaces . Currently the system recognizes only a small set of user activities and maps them to symbolic actions of the appropriate avatar (see figure 1) :
To recognize the user activities a set of hardware and software sensors is used. Browsing through different workspaces as well as the up- and downloading of documents are software events which can be captured within the BSCW. Editing a downloaded document can be recognized via appropriate software sensors (see section 3.2). To recognize whether the user is still working in his office, moving over the floor or is standing in the coffee room having a talk, additional sensors including the MOVY inertial tracker, webcams, and light sensors are used. The sensor data is either used to create events for the NESSIE awareness environment , which can be then be received by the SmallView browser, or the data is received directly from a DICI server connected to the sensor.
The 3D-representation of the daily work allows
remote users to achieve almost the same peripheral awareness as if
working within the same office. Additionally this representation can be
used to populate distributed virtual worlds. Users can be present and
accessible in the 3D world, even while doing their regular work.
Whereas the first scenario focused on peripheral awareness and the population of shared virtual worlds, the second scenario enhances the possibilities of communication in virtual worlds between distributed users.
One problem of most existing shared virtual environments representing users by avatars is, that the avatars do not behave naturally. The reason is not the individual representation (which can be arbitrary) but the lack of information about the user's mimic and gesture which is essential for most types of communication and especially for cooperation. In many situations like discussions or representations non-verbal feedback of the auditors is very important. Consider a teacher who tries to give an explanation to the class: until the teacher has finished the explanation the class usually listens. During that period the teacher does not get any verbal feedback from the class, but he gets a continuous stream of informations via non-verbal feedback. The class may be interested or lackadaisical, may agree or disagree or wonder about the explanation. The feedback of the class helps the teacher to adjust his explanation. For the class the mimic and gesture of the teacher is important, because they express or enforce important parts.
Some virtual environments allow users to express their mood by their avatar. However, this usually requires the user to explicitly activate an appropriate expression of his avatar (e.g. by pressing a button). This solution has two major drawbacks: first, people do not use it very often, since it is not intuitive. Second, the duration of avatar's expression (either predefined or until changed again) usually does not match the length of the real mood.
To represent body language, gestures, facial expression and mood for the representation of users by avatars, we combine sensors and tangible interfaces with symbolic avatar actions. The main aspect for the selection of the sensors is to get a maximum of information about the user but without disturbing or distracting the user while he or she is working or moving around. To reach this goal both hardware and software sensors can be used.
A hardware sensor used to detect user actions is the MOVY tracking system described earlier. Different types of user actions can be tracked by MOVY depending on the location of the sensor.
Figure 2 shows two users in shared virtual world represented by penguin avatars. The arm movements of the users are tracked by MOVY sensors and used to animate the wings of the penguins. The avatar gestures are transmitted over the network to enhance the expression and feedback during the communication of remote avatars.
Another hardware sensor is a camera used to realize the recognition of facial expressions. This requires that users provide a set of typical facial expressions for each mood to be visualized by the avatar before participating at a distributed meeting. Basic facial expressions are neutral, laughing, wondering, and anger. After a user has entered a shared virtual world, his face is continuously captured by a monitor mounted camera. The captured image is then compared to the pre-recorded facial expressions. If one of these expressions is recognized, the appropriate action of the user's avatar (usually a short animation) can be invoked (see figure 3). By default the predefined neutral expression is displayed. In addition to the recognition of the user's facial expression or mood, the absence of the user can easily be detected. This would simply require an additional reference image, showing the empty office environment as captured by the camera.
Software sensors can be used to detect the activity of the keyboard, mouse and the currently activated application window which give us information about the attention and the focus of the user. A background process called the SystemSpy captures continuously all input events (keyboard, mouse, activating a window, etc.). By that it detects which applications are currently used and if the user is active or idle. An idle user can be represented by a sleeping avatar whereas an inattentive user (who is working in another application but still present) may be represented by an avatar who is looking around.
All incoming data of the available hardware and software sensors is processed on the local host on which the sensor is located. The pre-processed data is made available to the VR application by the DICI client-server architecture. In order to create a natural behavior of the represented user the captured data has to be weighted depending on the individual input device. Input representing activity (e.g. the use of the keyboard or mouse, use of the VR application) are weighted higher than events representing inactivity (switching between several applications, bored facial expression) and fast sensors are weighted higher than slow sensors.
The recognition of user activities by a set of
sensors seems to require a rather high computational and personal
effort compared to the use video streams for the transmission of facial
expressions, gestures and body motions. This method however, does not
require a high bandwidth and thus can be used even over modem
connections. Additionally this approach also allows us to provide a
certain amount of privacy to the user (he or she might even stay
In this paper we presented our approach to enhance the avatar representation of users in shared virtual environments. We additionally showed how symbolic representations of user actions can be used to populate virtual worlds, providing an interesting and useful scenario for members of virtual teams or remote co-workers.
In our future work we will use additional
pre-processed third party motion data to provide a comprehensive
library of body language and object dynamics descriptions. This
approach will be based on new interfaces to existing body motion
libraries. Additionally we will further enhance the external interface
to VRML and its capabilities to handle kinematics.