EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Ziqi Liang, Jianzong Wang, Xulong Zhang, Yong Zhang, Ning Cheng, Jing Xiao

March 2024

Framework of EAD-VC

Abstract

Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at this https URL.

Type

Publication

In 2024 International Joint Conference on Neural Networks

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Voice Conversion Audio

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Abstract

Ziqi Liang

University of Science and Technology of China

Jianzong Wang

Honorary Director

Xulong Zhang

Executive Director