In the real world, speaker recognition systems usually suffer from serious performance degradation due to the domain mismatch between training and test conditions. To alleviate the harmful effect of domain… Click to show full abstract
In the real world, speaker recognition systems usually suffer from serious performance degradation due to the domain mismatch between training and test conditions. To alleviate the harmful effect of domain shift, unsupervised domain adaptation methods are introduced to learn domain-invariant speaker representations, which focus on addressing the single-source-to-single-target domain adaptation issue. However, labeled speaker data are usually collected from multiple sources, such as different languages, genres and devices. The single-domain adaptation methods can not deal with the complex multiple-domain mismatch problem. To address this issue, we propose a multiple-domain adaptation framework named CentriForce to extract domain-invariant speaker representations for speaker recognition. Different from previous methods, CentriForce learns multiple domain-related speaker representation spaces. To mitigate the multiple-domain mismatch, CentriForce reduces the Wasserstein distance between each pair of source and target domains in their domain-related representation space and meanwhile uses the target domain as an anchor point to draw all source domains closer to each other. In our experiments, CentriForce achieves the best performance on most of the 16 challenging adaptation tasks, compared with other competing adaptation methods. Ablation study and representation visualization further demonstrate its effectiveness for learning the domain-invariant speaker embedding.
               
Click one of the above tabs to view related content.