Skip to main content Skip to main navigation


TQ-AutoTest: Novel analytical quality measure confirms that DeepL is better than Google Translate

Vivien Macketanz; Aljoscha Burchardt; Hans Uszkoreit


Assessing translation quality is notoriously difficult. In the area of Machine Translation (MT) research and subsequent marketing, simple automatic comparisons of system output and human reference translations are usually taken as approximate indications of quality. The automatic measures used are BLEU, Meteor, and others. They have major shortcomings in that they do not provide reliable assessment of single sentences, that they do not provide any indication about the nature and severity of the error, or that they do not allow for a meaningful comparison across tools, languages, etc. Sometimes, A-B tests involving humans are used to compare systems’ quality, but they also do not provide any insights about the particular strengths and weaknesses of the systems. In a sequel of EC-funded projects (QTLaunchPad, QTLeap, QT21) we have devised a new method for assessing MT Quality in close cooperation with GALA and GALA members (LSPs, translators, researchers). Our method can be classified as a source-driven approach as opposed to the prevalent reference-based paradigm. The basic idea is to use a suite of segments exhibiting relevant (linguistic) phenomena and to assess the systems’ performance on each phenomenon individually. We understand linguistic phenomena in a broad way ranging from punctuation to very specific ones such as preposition stranding. Testing is semi-automatic, supported by the tool TQ-AutoTest as described below. The result is a quantitative and qualitative insight into the systems’ performance such as “system X gets 20% of the negations right, and 70% of the lexical ambiguities”. In this white paper, we will briefly describe the test suite and TQ-AutoTest tool and then showcase its application in a comparison of DeepL and Google.


Weitere Links