Skip to main content Skip to main navigation


Voice Privacy - Leveraging Multi-Scale Blocks with ECAPA-TDNN SE-Res2NeXt Extension for Speaker Anonymization

Razieh Khamsehashari; Yamini Sinha; Jan Hintz; Suhita Ghosh; Tim Polzehl; Carlos Franzreb; Sebastian Stober; Ingo Siegert
In: Proceedings of Interspeech 2022. Conference in the Annual Series of Interspeech Events (INTERSPEECH-2022), September 18-24, Incheon, Korea, Democratic People's Republic of, ISCA, 2022.


This paper presents the ongoing efforts on voice anonymization with the purpose to securely anonymize a speaker's identity in a hotline call scenario. Our hotline seeks out to provide help by remote assessment, treatment and prevention against child sexual abuse in Germany. The presented work originates from the joint contribution to the VoicePrivacy Challenge 2022 and the Symposium on Security and Privacy in Speech Communication in 2022. Having analyzed in depth the results of the first instantiation of the Voice Privacy Challenge in 2020, the current experiments aim to improve the robustness of two distinct components of the challenge baseline. First, we analyze ASR embeddings, in order to present a more precise and resistant representation of the source speech that is used in the challenge baseline GAN. First experiments using wav2vec show promising results. Second, to alleviate modeling and matching of source and target speaker characteristics, we propose to exchange the baseline x-vectors speaker identity features with the more robust ECAPA-TDNN embedding, in order to leverage its higher resolution multi-scale architecture. Also, improving on ECAPA-TDNN, we propose to extend the model architecture by integrating SE-Res2NeXt units, as the expectation that by representing features at various scales using a cutting-edge building block for CNNs, the latter will perform better than the SE-Res2Net block that creates hierarchical residual-like connections within a single residual block, allowing them to represent features at multiple scales. This expands the range of receptive fields for each network layer and depicts multi-scale features at a finer level. Ultimately, when including a more precise speaker identity embedding we expect to reach improvements for future anonymization for various application cases.