Automatic-Subtitling, Comparison on the Performance of Forced Alignment and Automatic Speech Recognition

Mino Lee Sasse, Stefan Schaffer, Aaron Ruß

In: 32. Konferenz Elektronische Sprachsignalverarbeitung. Elektronische Sprachsignalverarbeitung (ESSV-2021) March 3-5 Berlin/Virtual Germany TUDpress Dresden 2021.


This work is focusing on the automatic generation of subtitles using different tools that can be categorized as Forced Aligners (FAs) or Automatic Speech Recognizers (ASRs). A comparison of the performance of FA and ASR for the task of generating same-language subtitles was conducted. The prime motivation was a previous task, which was the extraction of sentence-utterances in different audio files using word-timestamps. Three different tools were used for this work: aeneas [1] which is an FA, Cerence [2], which is an ASR and Sonix [3], which is also an ASR. We conducted a technical evaluation and a subjective evaluation based on a case study. In this study people were presented with different stimuli, each stimulus using generated subtitles based on the time-information given by the different tools mentioned above. The resulting data of a case study confirmed a rise in performance of Cerence compared to aeneas.


Weitere Links

19_sasse.pdf (pdf, 374 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence