VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization

Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

April 2023

Framework of VQ-CL

Abstract

Voice Conversion(VC) refers to converting the voice char- acteristics of audio to another one as it is said by other people. Recently, more and more studies have focused on disentangle-based VC, which separates the timbre and lin- guistic content information from an audio signal to effectively achieve VC tasks. However, It’s still challenging to extract phoneme-level features from frame-level hidden representa- tions. This paper proposed a novel zero-shot voice conversion framework that utilizes contrastive learning and vector quan- tization to encourage the frame-level hidden features closer to the phoneme-level linguistic information, called VQ-CL. All objective and subjective experiment results show that VQ-CL has better performance than previous studies in sepa- rating content and voice characteristics to improve the sound quality of generated speech.

Type

Publication

In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Voice Conversion Audio

VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization

Abstract

Huaizhen Tang

University of Science and Technology of China

Xulong Zhang

Executive Director

Jianzong Wang

Honorary Director