Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

September 2022

Framework of proposed model

Abstract

One-shot voice conversion (VC) with only a single target-speaker speech for reference has become a new research direction. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentanglement during training. Experiments on the VCTK dataset show the model is a state-of-the-art one-shot VC framework in terms of naturalness and intellgibility of converted speech. In addition, we can transfer style of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.

Type

Publication

In 23rd Annual Conference of the International Speech Communication Association

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Voice Conversion Audio

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Abstract

Aolan Sun

Researcher

Jianzong Wang

Honorary Director

Huaizhen Tang

University of Science and Technology of China