The following tools are designed for processing steps after the annotation, such as extracting system specific lexica and n-grams, computing chi-square values, constructing templates and much more. The most important tools are listed and shortly descripted below.
xml_parser.pl
This script parses the NXT files and creates a corpus file in several steps.
- getting all information already available in the *.xml files
- putting disrupted SDS turns together
- inserting empty user answer string where the user made no input
- setting user and SDS turn id numbers
- putting disrupted user answers together
- extracting slots from alternative_questions and inserting self_correct tags for fillers
- ?
- marking up content of user and SDS utterances with the help of key words and rules
- constructing templates
chi_square.prl
Script for extraction of key words from transcriptions.
a) sds2chi-all all SDS against Tiger
b) sds2chi-domains chi-values for each class of both domains
The following files are created (or updated):
corpus-info length of the corpora
residue-file names of particular compared corpora (?Vergleichskorpora)
chisquare.log logfile
parsefile file created by the nxt_parser
all .lex and .lex.html files
da_ngrams.pl
This script extracts bigrams, trigrams and 4-grams of dialog acts (extract-das.pl) and counts the transition frequencies of SDS prompts to user answers for bigrams (get-probabilities.pl).
system-lexica.pl
Get system specific lexica from the corpus.
annotate-domains.pl
This script produces an annotation file with general dialog informations.
- User type (lay/expert)
- SDS info (name, year)
- Domain info (content, goal)