Next: Conclusion Up: Language-Processing Strategies and Mixed-Initiative Previous: The Dialogue Management

A Preliminary Evaluation

This section reports the results of a preliminary evaluation, aimed particularly at testing the relative utility of the Robust Parser and the CLE, respectively. To this end, we used two configurations of the system: One of them (RP--CLE) corresponds to the architecture shown in Figure 1 , in which the CLE and the Robust Parser work in parallel. In the other (RP-only), the CLE was disabled, thus only containing the shallow processing path.

Two similar tasks, A and B, were created, each involving a trip with at least three legs during two consecutive days, suitable for both train and air travel. Two subjects were used. Each of them was given the opportunity to try out the RP--CLE version of the system. More specifically, what they used was the demo version of the system, in which system components get highlighted as they engage in processing, and in which the recognized utterance as well as the system's responses are successively written into a window. The purpose of this was to give the subjects a better sense of what was going on, since otherwise the system could remain silent for typically 30--60 seconds on Internet database queries. When the subjects felt that they were able to handle the system, they were presented with tasks A and B in different orders.

The experiment resulted in four dialogues, each consisting of between 22 and 28 user--system turns. Each turn was tagged with ``OK'' or ``failure'', depending on whether the system had managed to move the dialogue forward or not in response to the user's utterance (provided that the utterance was reasonable given the context). ``Failure'' thus consists of cases where the system responded that it did not understand the last utterance or where its response constituted a misunderstanding. Furthermore, each turn was tagged with ``user'' or ``system'', depending on whether the subject's utterance was a response to a system initiative or whether the utterance constituted a user initiative (for example, a spontaneous request for information or a counter-question). The tasks were designed so as to encourage mixed initiative, and both subjects displayed a majority of user initiatives in their dialogues.

Because of the small size of the experiment, the results at this point can only be taken as suggestive. Nevertheless, to provide a rough idea of where we stand in our on-going work, we shall briefly present some figures that we obtained.

To begin with, the RP--CLE configuration appeared slightly more efficient in terms of moving the dialogue forward than the RP-only one: The RP--CLE and RP-only dialogues used on average 22 and 27 moves, and out of these had 15 and 14 ``OK'' turns, respectively. However, in terms of providing successful analyses (in the cases when at least one fragment of the output from the speech recognizer was reasonable), the RP was the slightly more successful one in the RP--CLE configuration: It succeeded on average on 16 turns, whereas the CLE succeeded on 13. Surprisingly, the RP also turned out to be a bit more successful on those turns where the user had taken the initiative: it was successful on almost 2/3 of those cases, whereas the CLE was successful on about half of them.

A closer analysis revealed that on five times in each of the RP--CLE dialogues, failure of the CLE to deliver a correct analysis was due to the fact that it had chosen a wrong fragment (usually too long). The reason for this is that the CLE attempts to analyse the longest grammatical fragment on the path chosen from the N-best list, something which may lead to strange results (compare the example further below).

In terms of which component causes the most turn failures, the picture was unclear. In the RP--CLE case, only a single ``failure'' turn in each dialogue was actually due to language analysis (in which case both the RP and the CLE failed, though the CLE had the better analyses). In the RP-only case, the RP caused none at all of 11 failures in one of the dialogues, whereas in the other, it caused 5 of 15 failures.

The figures also indicate that language analysis was not the main bottleneck of the system (both speech recognition and dialogue management were the sources of more failed turns). This might have played a role when none of the subjects said that they had noted any difference in terms of overall performance between the RP--CLE and RP-only configurations of the system. But the relatively small difference in terms of overall turn efficiency, as indicated above, might also have contributed to this.

Our analysis also indicates that the Dialogue Manager is quite good at choosing between analyses from the RP and CLE: In the two RP--CLE dialogues, there is only a single case of the Dialogue Manager choosing the wrong alternative. (In this case, it chooses a CLE analysis which lacks some information but the rest of whose contents are correct, thereby still managing to move the dialogue forward.)

We now turn to some qualitative differences between the RP and CLE that we have observed in our analysis above. To begin with, the obvious advantage of the Robust Parser (RP) is that it is rather undisturbed by ungrammaticalities, disfluences and (to some extent) recognition errors in the input. For example, the utterance

Hej jag beställer en flygbiljett den åttonde i sjätte tisdag från Stockholm till Sundsvall. ( Hi I'm ordering a flight ticket on June eighth from Stockholm to Sundsvall).
recognized as VAD HEJ JAG BESTäLLER JAG VILL JAG DEN åTTONDE I SJäTTE I JAG MMM Då STOCKHOLM TILL SUNDSVALL. (roughly What hi I'm ordering I want I on June eighth in I mmm then Stockholm to Sundsvall.)

is analysed perfectly by the RP. The CLE locates the longest grammatical fragment ``den åttonde i sjätte'', and produces an analysis that includes the date but not the destination and origin cities of the trip.

As pointed out above, the strategy of choosing the longest grammatical fragment can sometimes lead the CLE completely astray. The utterance

Jag bokar det tåget. ( I book that train.)

was misrecognized as

JAG BOKAR DET DET TåGET

whose longest grammatical fragment is ``bokar det det tåget'' ( ``does that book that train''), which is something completely different from what the user actually said. The CLE failed to produce any FUD, while the RP got it right.

On the other hand, the RP can produce erroneous results because it is analysing unconnected bits and pieces of sentences. For instance, the RP analysed ``Klockan nitton eller senare'' ( ``at seven pm or later'') as `` at seven pm, and later than some previously mentioned trip'', because it triggered on the two separate patterns ``klockan nitton'' and ``senare'' without considering the relation between them.

Actually, the very robustness of the RP can sometimes prove to be a disadvantage. In one case, the test subject meant to say ``Jag har företagsrabatt på flyget'' ( ``I have a corporate discount on air travelling''), but the input became totally garbled: ``JA Då HAR FöRETAG FYRA VAD FöR ATT FLYGA'' (roughly ``Yes then has company four what for to fly''). The CLE did not produce any FUD. The RP reacted on ``to fly'', and its analysis together with the keyword ``Ja'' ( ``Yes'') in the utterance made the system book a previously mentioned flight alternative. If the RP had been disconnected, the system's reply would instead have been to ask the user to rephrase her utterance; certainly a more sensible reaction.

A considerable advantage of the CLE is its ability to look at, and possibly combine, the top N hypotheses from the recognizer. At several occasions this proved to be important, for example, in correcting the top hypothesis ``hur och retur'' into ``tur och retur'' ( ``return trip''). The RP, which only has access to the top hypothesis, could only produce the ``null'' result wh(X, []) in this case.

Next: Conclusion Up: Language-Processing Strategies and Mixed-Initiative Previous: The Dialogue Management

Mats Wiren
Mon Oct 25 13:51:54 MET DST 1999