Interactive intelligent speech technologies are conquering our homes. In the Emonymous project, we are pursuing the goal of completely anonymizing a speaker's identity without losing emotional and speech content information. From the point of view of data protection, this exploitation of speech data also offers enormous application potential.
The SLT contributes significant expertise in the areas of:
- speech synthesis, for example voice conversion (VC), speech-to-text (STT), voice cloning, zero-shot learning.
- Speech Recognition, e.g. Automatic Speech Recognition (ASR), Multi-Lingual Speech Recognition.
- Speaker Recognition, e.g. Automatic Speaker Recognition and Verification (ASV), Multi-Lingual Speaker Recognition
- Emotion Recognition from Speech, Text, Video/Images, Multimodal*, e.g. Transformer-based Models, Acoustic- , Linguistic- (Language Models), and Visual Models (Facial Expression, Landmarks)*.
- Crowd-based AI support, *e.g. automated online orchestrated crowd- and expert sourcing hybrid AI+Human workflows for high quality data acquisition.
- AI in the area of pre-trained language models, transfer-learning, cross-lingual learning, continuous learning, frugal AI.
Focus: Due to ever advancing AI, interactive and intelligent voice assistants are conquering more and more everyday life. However, privacy concerns prevent them from being used beyond the home. In particular, the identification of a speaker's identity through voice due to the large amount of data collected prevents an effective use of these technologies in sensitive task areas (health sector, learning support). For many applications, however, it is only necessary to know what was said and not who said it. Here, anonymizing the speaker id can prevent identification in (cloud-based) further processing. However, speech, based on how something was said, conveys further indicators (e.g. emotions, personality, proficiency) which are necessary to be able to react adequately to the individual needs of the user and thus improve the interaction.
The goal of this joint project is to completely anonymize the speaker identity while preserving the emotional and speech content-related information as far as possible. For this purpose, we rely on the latest AI developments with Voice Conversion or Differential Digital Signal Processing.
In combination with a newly developed differentiable similarity measure, it is possible to derive indicators for the success of anonymization. The developed techniques allow to advance diverse innovative applications while preserving speaker anonymity and strengthen applications of science as well as Germany as a business location.
Lead: Dr. Tim Polzehl Dr. Tim Polzehl leads the AI-based developments in the area of speech-based applications of the Speech and Language Technology department at DFKI.In addition, he leads the area of "Next Generation Crowdsourcing and Open Data" and is an active member of the "Speech Technolgy" group of the Quality and Usability Labs (QU-Labs) at the Technical University of Berlin.
Profile QU-Labs TU-Berlin: https://www.tu.berlin/index.php?id=29499/
Technische Universität Berlin, Quality and Usability Lab Otto-von-Guericke-Universität Magdeburg, Fachgebiet Mobile Dialogsysteme