Index of /~mapo02/hjerson

      Name                    Last modified       Size  Description

[DIR] Parent Directory 29-Jun-2012 15:47 - [   ] LICENCE 29-Jun-2012 15:51 1k [   ] example.cats 11-Aug-2011 16:12 1k [   ] example.errors 11-Aug-2011 16:12 1k [TXT] example.html 11-Aug-2011 16:12 1k [   ] example.hyp 11-Aug-2011 16:12 1k [   ] example.hyp.base 11-Aug-2011 16:12 1k [   ] example.hyp.pos 11-Aug-2011 16:12 1k [   ] example.pos.cats 11-Aug-2011 16:12 1k [TXT] example.pos.html 11-Aug-2011 16:12 1k [   ] example.ref 11-Aug-2011 16:12 1k [   ] example.ref.base 11-Aug-2011 16:12 1k [   ] example.ref.pos 11-Aug-2011 16:12 1k [   ] example.senterrorrates 11-Aug-2011 16:12 1k [   ] example.totalerrorrates 11-Aug-2011 16:12 1k [TXT] gpl.txt 11-Aug-2011 17:10 34k [   ] hjerson+.py 06-May-2015 18:55 30k [   ] hjerson.py 26-Sep-2014 18:32 27k


######################################################
# Hjerson: A tool for automatic error classification #
######################################################

By: Maja Popovic <maja.popovic@dfki.de>,  August 2011



Hjerson detects translation errors using WER alignment and RPER
(reference PER) and HPER (hypotheses PER) errors.
It is written in Python, so you have to install Python 2 or Python 3.

The following five error classes are supported:

- inflectional (morphological) errors, i.e. incorrect word forms
- reordering errors, i.e. incorrect word order
- missing words
- extra words
- lexical errors, i.e. incorrect lexical choice

The option -h, --help outputs a description of the available command
line options.



The required inputs are:
- translation reference and hypothesis
- base forms of translation reference and hypothesis

If any additional information at the word level is available (such as
POS tags), it is possible to incorporate it as well in order to obtain
more detals.

The required format of all inputs is tokenised (and preferably
true-cased) raw text containing one sentence per line.

In the case of multiple references, all available reference
sentences must be separated by the symbol #.



The default output are overall (document level) raw error counts and
error rates (counts normalised over the reference or hypothesis length) 
for each of the five error classes.

Optional outputs are:
-s, --sent  sentence-errors.txt
    raw error counts and error rates at the sentence level are written
    in "sentence-errors.txt"

-c, --cats  categories.txt
    Original reference and hypothesis words labelled with a
    corresponding error class are written in "categories.txt"

-m, --html  categories.html
    Original reference and hypothesis words with coloured errors in
    HTML format.

If the additional information is used, only "categories.txt" and
"categories.html" will be different.


Example for testing:
~~~~~~~~~~~~~~~~~~~~

You can try the tool on the given example using example.ref and
example.ref.base as reference inputs along with example.hyp and
example.hyp.base as hypothesis inputs.

If you want to try additional information, you can use reference and
hypothesis POS tags example.ref.pos and example.hyp.pos.

Then you can compare obtained files with example.totalerrorrates,
example.senterrorrates, example.(pos.)cats and example.(pos.)html.