Index of /~elav01/download/mtj12

      Name                    Last modified       Size  Description

[DIR] Parent Directory 13-Oct-2020 02:08 - [CMP] avramidis_features.c..> 30-Jul-2012 17:08 1.8M [CMP] avramidis_features.t..> 30-Jul-2012 17:08 1.5M [CMP] avramidis_features.x..> 30-Jul-2012 17:08 2.9M [CMP] avramidis_features_2..> 04-Sep-2012 18:30 897k [CMP] avramidis_features_2..> 04-Sep-2012 18:29 993k [   ] avramidisfeatures_bi..> 30-Jul-2012 17:08 3k

====================================================================
Quality Estimation features for WMT12 Quality Estimation Shared Task
eleftherios.avramidis@dfki.de , July 2012
====================================================================

    If you use these features, I would appreciate it if you cite:

        Avramidis et. al (2011) \cite{avramidis11:WMTeval} if used the PCFG parsing features
        Avramidis (2012)  \cite{avramidis:2012:WMT} if used all the rest

    and of course the respective authors of the tools. 
    See attached file avramidisfeatures_bibtex.bib
    
    Please observe license of tools for the use of the data


Format description
==================

The annotation of the training set and test set of the shared task can be found in the files training.tab and test.tab respectively. The files contain one sample per line, with all generated features in tab separated columns. There is an additional version of the files in .csv format, where feature columns are separated with commas. An XML file is also given for convenience.

The quality score provided by the organizers, as well as the original source and target strings are also included in the last two columns

Heading:
- The first line of the files contains the feature names. 
- The second line of the files describes the type of the feature, this should be 'string', or 'c' for continuous. In these sets all numerical features are continuous. 
- The third line has a 'm' for meta-features, i.e. string features that cannot be trained on.



Features description
====================

In the tab/comma separated files, the feature names are defined on the first line of each file. 

Note: the number of the features is not equal for the two files. Features that have zero values for all items do not have a column. This is some engineering missed point, so one may have to locate features names not apparent in the test set and filter them out of the training set.

Feature origin: "src_", or "tgt-1_": The feature name prefix, separated with an underscore, defines whether the feature has been produced by analyzing the source, or the target, respectively. Comparison features which take both source and target in consideration have been also appointed to target.

Ratios: For all the features that appear both in source and target, their ratios are also given

Feature family: This is defined by the second part of the feature name, which is surrounded by underscores. The feature set is organized in the following feature families:

Berkeley Parser
---------------
Prefixed with "_berkeley_" and "_berkley_": Sentence-level statistics of the PCFG parse for the sentence. It cointains the log likelihood, the confidence of the best parse, the average confidence of all parses and the number of n-best (berkeley-n) trees generated. The best parse in bracketed format is also included for reference purposes or further feature extraction. 

We used Berkeley Parser v1.1 \cite{petrov06:berkeleyParser,Petrov07improvedinference}. English is parsed with the provided grammar. We trained our own Spanish Grammar based on AnCora \cite{TAUL08.35}. 

Parse label-count/ratios
------------------------
Named as "_parse-LABEL" and "_parse-LABEL_ratio", these are counts of the basic node labels of the parse tree, as assigned by the Berkeley Parser. English and Spanish variations over labels names have been manually aligned and their ratios are also given. For the meaning of the labels refer to the respective grammars.

languagetool.org
----------------
Prefixed with "_lt_" : "LanguageTool Style and Grammar Checker", an Open Source style and grammar proofreading software. For each sentence, we provide 
_lt_RULE_NAME: the occurences of each matching rule and
_lt_RULE_NAME_chars: the count of characters affected by each rule
_lt_errors: overall count of matching rule occurences per sentence
_lt_errors_chars: overall count of characters affected by matching rule occurences per sentence
Version 1.6 was used. 
For the exact description of each rule, look up its RULE_NAME at http://community.languagetool.org/rule/list

Decoding scores
---------------
Named as "_d_SCORENAME" or "_d_SCORENAME_[avg|var|std]". This scores have been extracted from the decoding logs provided by the organizers. The overall scores of the best hypothesis are provided as given. For the intermediate scores seen in the generation of the best hypothesis, we provide their average (_avg), variance (_var) and standard deviation (_std). Further description for the meaning of these can be found in the Moses manual or summarized in my shared task paper. \cite{avramidis:2012:WMT}

Length
------
Prefixed with "_l_": count of tokens, characters (chars), average character count per token (avgchars)

Language Model
--------------
Prefixed with "_lm_": uni-gram, bi-gram, tri-gram and 5-gram probability, as well as count of unknown words (unk). We trained our own language model with SRILM \cite{stolcke02:srilm} based on the monolingual data of WMT11. 




Style and grammar features could not be included due to licensing issues. If you need further information please contact the authors.

The parallelization of the pipeline was organized with Ruffus \cite{Goodstadt:2010:RUF:1883328.1883347}

Aknowledgments: 
Thanks to Lukas Poustka for his engineering support on several parts of the annotation process.