Prosodic and other Long-Term Features for Speaker Diarization

G. Friedland, O. Vinyals, Y. Huang, Christian Müller

In: IEEE Transactions on Speech and Audio Processing 17 5 Seiten 985-993 2009.


Speaker diarization is defined as the task of determining "who spoke when" given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized datasets (NIST RT06 and RT07) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007.

friedland_et_al_2009_ieee.pdf (pdf, 717 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence