Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

December 2022

The overview of the proposed text-to-speech method

Abstract

In this paper, we proposed Adapitch, a multi-speaker TTS method that makes adaptation of the supervised module with untranscribed data. We design two self supervised modules to train the text encoder and mel decoder separately with untranscribed data to enhance the representation of text and mel. To better handle the prosody information in a synthesized voice, a supervised TTS module is designed conditioned on content disentangling of pitch, text, and speaker. The training phase was separated into two parts, pretrained and fixed the text encoder and mel decoder with unsupervised mode, then the supervised mode on the disentanglement of TTS. Experiment results show that the Adaptich achieved much better quality than baseline methods.

Type

Publication

In 18th International Conference on Mobility, Sensing and Networking

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

TTS Audio

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

Abstract

Xulong Zhang

Executive Director

Jianzong Wang

Honorary Director