SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Jianzong Wang, Xulong Zhang, Haobin Tang, Aolan Sun, Ning Cheng, Jing Xiao

April 2023

The architecture of speech representation learning

Abstract

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acous- tic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low- quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre- training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

Type

Publication

In 2023 IEEE International Joint Conference on Neural Network

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

TTS

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Abstract

Jianzong Wang

Honorary Director

Xulong Zhang

Executive Director

Haobin Tang

University of Science and Technology of China

Aolan Sun

Engineer

Ning Cheng

Researcher