Fusion Of Short Term And Long Term Features For Improved Speaker Diarization

G. Friedland; O. Vinyals; Y Huang; Christian Müller

In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. International Conference on Acoustics, Speech and Signal Processing (ICASSP-09), April 19-24, Taipei, Taiwan, Province of China, IEEE, 2009.


The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized data sets (NIST RT) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. This result was also verified on a wide set of meetings, which we call CombDev, that contains 21 meetings from previous evaluations. Since the prosodic and long-term features were selected using a diarization-independent speaker-discriminability study, we are confident that the same features are able to improve other systems that perform similar tasks


Weitere Links

friedland_et_al_2009_icassp.pdf (pdf, 199 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence