MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution

Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

January 2021

The architecture

Abstract

Recent neural vocoders usually use a WaveNet-like network to capture the long-term dependencies of the waveform, but a large number of parameters are required to obtain good modeling capabilities. In this paper, an efficient network, named location-variable convolution, is proposed to model the dependencies of waveforms. Different from the use of unified convolution kernels in WaveNet to capture the dependencies of arbitrary waveforms, location-variable convolutions utilizes a kernel predictor to generate multiple sets of convolution kernels based on the melspectrum, where each set of convolution kernels is used to perform convolution operations on the associated waveform intervals. Combining WaveGlow and location-variable convolutions, an efficient vocoder, named MelGlow, is designed. Experiments on the LJSpeech dataset show that MelGlow achieves better performance than WaveGlow at small model sizes, which verifies the effectiveness and potential optimization space of location-variable convolutions.

Type

Publication

In 2021 IEEE Spoken Language Technology Workshop

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

TTS Audio

MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution

Abstract

Zhen Zeng

Researcher

Jianzong Wang

Honorary Director