Contrastive Learning for improving End-to-end Speaker Verification


Speaker verification involves examining the speech signal to authenticate the claim of a speaker as true or false. Deep neural networks are one of the successful implementations of complex non-linear models to learn unique and invariant features of data. They have been employed in speech recognition tasks and have shown their potential to be used for speaker recognition also. However, the overfitting problem is remained to prevent the model’s performance. In this study, we apply contrastive learning on speaker verification tasks to solve the robustness problem. Besides, we introduce domain adaptive loss on the tasks. Experimental results and ablation study that indicate that our proposed model outperforms various baseline end-to-end methods significantly by at least relative 10%, including d-vector approaches, deep-speaker, and generalized end-to-end model, for text-dependent speaker verification on a company’s internal text-dependent voice command DataSet.

International Joint Conference on Neural Networks
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.