SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

The architecture of speech representation learning

Abstract

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acous- tic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low- quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre- training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

Type
Publication
In 2023 IEEE International Joint Conference on Neural Network
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Haobin Tang
Haobin Tang
University of Science and Technology of China
Aolan Sun
Aolan Sun
Engineer