Audio

A Language Model Based Pseudo-Sample Deliberation for Semi-supervised Speech Recognition

End-to-end modeling requires tremendous amounts of transcribed speech to achieve an automatic speech recognition (ASR) model with high …

Cheng Yi, Jianzong Wang, Ning Cheng, Shiyu Zhou, Bo Xu

A Language Model Based Pseudo-Sample Deliberation for Semi-supervised Speech Recognition

CACnet: Cube Attentional CNN for Automatic Speech Recognition

End-to-end models have been widely used in Automatic Speech Recognition (ASR). Convolutional Neural Networks (CNNs) can effectively use …

Nan Zhang, Jianzong Wang, Wenqi Wei, Xiaoyang Qu, Ning Cheng, Jing Xiao

Loss Prediction: End-to-End Active Learning Approach For Speech Recognition

End-to-end speech recognition systems usually require huge amounts of labeling resource, while annotating the speech data is …

Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao

Loss Prediction: End-to-End Active Learning Approach For Speech Recognition

Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition

Recently, there are several domains that have their own feature extractors, such as ResNet, BERT, and GPT-x, which are widely used for …

Cheng Yi, Jianzong Wang, Ning Cheng, Shiyu Zhou, Bo Xu

Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition

Singer Identification Using Deep Timbre Feature Learning with KNN-NET

In this paper, we study the issue of automatic singer identification (SID) in popular music recordings, which aims to recognize who …

Xulong Zhang, Jiale Qian, Yi Yu, Yifu Sun, Wei Li

Singer Identification Using Deep Timbre Feature Learning with KNN-NET

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the …

Yongwei Gao, Xulong Zhang, Wei Li

GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis

This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis, …

Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, Lingwei Kong, Jing Xiao

MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution

Recent neural vocoders usually use a WaveNet-like network to capture the long-term dependencies of the waveform, but a large number of …

Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution

Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion

In this paper, we propose an end-to-end speech recognition network based on Nvidia’s previous QuartzNet [1] model. We try to …

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion