VOICE Awards Corpus

The following tools are designed for processing steps after the annotation, such as extracting system specific lexica and n-grams, computing chi-square values, constructing templates and much more. The most important tools are listed and shortly descripted below.

xml_parser.pl

This script parses the NXT files and creates a corpus file in several steps.

getting all information already available in the *.xml files
putting disrupted SDS turns together
inserting empty user answer string where the user made no input
setting user and SDS turn id numbers
putting disrupted user answers together
extracting slots from alternative_questions and inserting self_correct tags for fillers
?
marking up content of user and SDS utterances with the help of key words and rules
constructing templates

Each step produces an output file which serves as input for the next step.

chi_square.prl

Script for extraction of key words from transcriptions.

a) sds2chi-all all SDS against Tiger
b) sds2chi-domains chi-values for each class of both domains

The following files are created (or updated):
corpus-info length of the corpora
residue-file names of particular compared corpora (?Vergleichskorpora)
chisquare.log logfile
parsefile file created by the nxt_parser
all .lex and .lex.html files

da_ngrams.pl

This script extracts bigrams, trigrams and 4-grams of dialog acts (extract-das.pl) and counts the transition frequencies of SDS prompts to user answers for bigrams (get-probabilities.pl).

system-lexica.pl

Get system specific lexica from the corpus.

annotate-domains.pl

This script produces an annotation file with general dialog informations.

User type (lay/expert)
SDS info (name, year)
Domain info (content, goal)

The VOICE Awards Corpus

A Spoken Human-Machine Dialog Corpus