Project | OLIVE

Duration: 05/01/1998 - 04/30/2000

Play it again, Sam! - Retrieval of video material based on speech-recognition

Olive, closely related to the Pop-Eye project which uses subtitles for the retrieval of video material, takes this approach further by additionally taking into account the spoken word. Based on automatic speech transcriptions and other back ground material, such as press releases or other descriptive material, Olive covers the full range of linguistic material associated with video material. As in Pop-Eye, the basic approach is to provide detailed access to the content of videos by indexing (time-coded) written texts. The user can pose textual or bibliographic queries, which will then take him to index terms and/or videographic data. From there he can retrieve image material, primarily in the format of automatically extracted stills or key frames associated with the text, and finally the digitised video stream itself.

Some of the limitations of speaker-independent large scale speech recognition are partly compensated by taking into account additional textual material. The reliability of the recognition is thus enhanced by domain or document specific vocabulary reducing the number of hypotheses. By aligning manual transcripts (produced e.g., for subtitling) with the automatic transcriptions, the time-code of the latter can be transferred to the first quality material of the former.

Olive also takes the cross-lingual retrieval functionality of Pop-Eye further by combining document translation, where an index is build from documents which are automatically translated offline, with query translation, where the query is translated online in order to access an index build from the document originals.

Video-retrieval on the basis of indices constructed from transcribed speech
Three step retrieval process from textual index terms, to associated still images, to video segments
Cross-lingual retrieval on the basis of document translation and query translation
Additional background material integrated for improved speech recognition and for further disclosure of video contents