Starting from human annotations, we provide a strategy based on
machine learning that performs preference ranking on alternative machine
translations of the same source, at sentence level. Rankings are decomposed
into pairwise comparisons so that they can be learned by binary classifiers, us-
ing black-box features derived from linguistic analysis. In order to recompose
from the pairwise decisions of the classifier, they are weighed with their clas-
sification probabilities, increasing the correlation coefficient by 80%. We also
demonstrate several configurations of successful automatic ranking models.
The best configurations achieve a correlation with human judgments measured
by Kendalls tau at 0.27. Although the method does not use reference trans-
lations, this correlation is comparable to the one achieved by state-of-the-art
reference-aware automatic evaluation metrics such as smoothed BLEU, ME-
TEOR and Levenshtein distance.