Ning Cheng

He is an accomplished researcher with a rich background in the field of speech technology and natural language processing. With over a decade of experience, he has made substantial contributions to the domains of speech recognition, speech synthesis, and natural language processing. He earned his Bachelor’s degree in Mathematics and Applied Mathematics from Beijing University of Science and Technology in 2003, followed by a Master’s degree in Systems Engineering from the same institution in April 2006. He completed his Ph.D. in Pattern Recognition and Intelligent Systems from the Graduate University of the Chinese Academy of Sciences in July 2009. Throughout his career, he began as an Assistant Researcher at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, in July 2009. Later, he joined Microsoft (China) Co., Ltd., at the Search Technology Center as an Associate Researcher in September 2011. In 2015, he transitioned to the Institute of Software, Chinese Academy of Sciences, where he assumed the role of a Senior Engineer. He joined PAT (Shenzhen) Co., Ltd., in September 2016, as an Associate Researcher.

He has authored more than 80 academic papers presented at top international conferences in the field of speech technology. His work has also been featured in esteemed Chinese journals such as “China Science,” “Journal of Electronics,” and “Acta Automatica Sinica.” In addition to his publication record, he has actively contributed to various research projects, receiving funding from prestigious sources like the National 973 Program, the National 863 Program, and the National Natural Science Foundation of China. Furthermore, he has served as the principal investigator for a research project supported by the Guangdong Provincial Natural Science Foundation.

  • TTS
  • ASR
  • NLP
  • Voice Conversion
  • Artificial Intelligence


  1. Medical Speech Symptoms Classification via Disentangled Representation (2024) In CSCWD2024 (CCF-C)
  2. EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model (2024) In ICASSP2024 (CCF-B)
  3. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis (2024) In ICASSP2024 (CCF-B)
  4. Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval (2024) In ICASSP2024 (CCF-B)
  5. Leveraging Biases in Large Language Models: bias-kNN for Effective Few-Shot Learning (2024) In ICASSP2024 (CCF-B)
  6. Research on Audio Model Generation Technology Based on Hierarchical Federated Framework (2024) In CAAI TIT
  7. On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models (2023) In AAAI2023 (CCF-A)
  8. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion (2023) In MM2023 (CCF-A)
  9. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation (2023) In SpaCCS2023
  10. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding (2023) In ISPA2023 (CCF-C)
  11. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation (2023) In BDCloud2023
  12. PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter (2023) In EMNLP2023 (CCF-B)
  13. AOSR-Net: All-in-One Sandstorm Removal Network (2023) In ICTAI2023 (CCF-C)
  14. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval (2023) In ICTAI2023 (CCF-C)
  15. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework (2023) In ICTAI2023 (CCF-C)
  16. Machine Unlearning Methodology base on Stochastic Teacher Network (2023) In ADMA2023 (CCF-C)
  17. Symbolic and Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music (2023) In ADMA2023 (CCF-C)
  18. Voice Conversion with Denoising Diffusion Probabilistic GAN Models (2023) In ADMA2023 (CCF-C)
  19. Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism (2023) In INTERSPEECH2023 (CCF-C)
  20. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis (2023) In INTERSPEECH2023 (CCF-C)
  21. Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning (2023) In INTERSPEECH2023 (CCF-C)
  22. Prompt Guided Copy Mechanism for Conversational Question Answering (2023) In INTERSPEECH2023 (CCF-C)
  23. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model (2023) In IJCNN2023 (CCF-C)
  24. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy (2023) In ICASSP2023 (CCF-B)
  25. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations (2023) In ICASSP2023 (CCF-B)
  26. Improving Music Genre Classification from Multi-modal Properties of Music and Genre Correlations Perspective (2023) In ICASSP2023 (CCF-B)
  27. Learning Speech Representations with Flexible Hidden Feature Dimensions (2023) In ICASSP2023 (CCF-B)
  28. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis (2023) In ICASSP2023 (CCF-B)
  29. VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization (2023) In ICASSP2023 (CCF-B)
  30. Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data (2022) In MSN2022 (CCF-C)
  31. Improving Imbalanced Text Classification with Dynamic Curriculum Learning (2022) In MSN2022 (CCF-C)
  32. Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach (2022) In MSN2022 (CCF-C)
  33. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition (2022) In MSN2022 (CCF-C)
  34. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse (2022) In MSN2022 (CCF-C)
  35. Semi-Supervised Learning Based on Reference Model for Low-resource TTS (2022) In MSN2022 (CCF-C)
  36. Shallow Diffusion Motion Model for Talking Face Generation from Speech (2022) In APWeb-WAIM2022 (CCF-C)
  37. Boosting Star-GANs for Voice Conversion with Contrastive Discriminator (2022) In ICONIP2022 (CCF-C)
  38. Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar (2022) In ICTAI2022 (CCF-C)
  39. Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (2022) In INTERSPEECH2022 (CCF-C)
  40. Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation (2022) In INTERSPEECH2022 (CCF-C)
  41. Uncertainty Calibration for Deep Audio Classifiers (2022) In INTERSPEECH2022 (CCF-C)
  42. Adaptive Activation Network for Low Resource Multilingual Speech Recognition (2022) In IJCNN2022 (CCF-C)
  43. MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification (2022) In IJCNN2022 (CCF-C)
  44. MetaSID: Singer Identification with Domain Adaptation for Metaverse (2022) In IJCNN2022 (CCF-C)
  45. Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features (2022) In IJCNN2022 (CCF-C)
  46. Speech Augmentation Based Unsupervised Learning for Keyword Spotting (2022) In IJCNN2022 (CCF-C)
  47. SUSing: SU-net for Singing Voice Synthesis (2022) In IJCNN2022 (CCF-C)
  48. TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS (2022) In IJCNN2022 (CCF-C)
  49. AVQVC: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022) In ICASSP2022 (CCF-B)
  50. DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning (2022) In ICASSP2022 (CCF-B)
  51. nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech (2022) In ICASSP2022 (CCF-B)
  52. Self-Attention for Incomplete Utterance Rewriting (2022) In ICASSP2022 (CCF-B)
  53. Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation (2022) In ICTAI2022 (CCF-C)
  54. Supervised Contrastive Meta-learning for Few-Shot Classification (2022) In HPCC2022 (CCF-C)
  55. VU-BERT: A Unified Framework for Visual Dialog (2022) In ICASSP2022 (CCF-B)
  56. CycleGEAN: Cycle Generative Enhanced Adversarial Network for Voice Conversion (2021) In ASRU2021
  57. Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples (2021) In ASRU2021
  58. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training (2021) In ASRU2021
  59. Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation (2021) In INTERSPEECH2021 (CCF-C)
  60. Speech2Video: Cross-Modal Distillation for Speech to Video Generation (2021) In INTERSPEECH2021 (CCF-C)
  61. Variational Information Bottleneck for Effective Low-Resource Audio Classification (2021) In INTERSPEECH2021 (CCF-C)
  62. A Language Model Based Pseudo-Sample Deliberation for Semi-supervised Speech Recognition (2021) In IJCNN2021 (CCF-C)
  63. CACnet: Cube Attentional CNN for Automatic Speech Recognition (2021) In IJCNN2021 (CCF-C)
  64. Loss Prediction: End-to-End Active Learning Approach For Speech Recognition (2021) In IJCNN2021 (CCF-C)
  65. Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition (2021) In IJCNN2021 (CCF-C)
  66. Cross-Language Transfer Learning and Domain Adaptation for End-to-End Automatic Speech Recognition (2021) In ICME2021 (CCF-B)
  67. LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation (2021) In ICASSP2021 (CCF-B)
  68. Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition (2021) In ICASSP2021 (CCF-B)
  69. End-To-End Silent Speech Recognition with Acoustic Sensing (2021) In SLT2021
  70. GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis (2021) In SLT2021
  71. MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution (2021) In SLT2021
  72. Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion (2021) In SLT2021
  73. A Novel Capsule Aggregation Framework for Natural Language Inference (2021) In APWeb-WAIM2021 (CCF-C)
  74. Joint Intent Detection and Slot Filling Based on Continual Learning Model (2021) In ICASSP2021 (CCF-B)
  75. Self-supervised Learning for Semantic Sentence Matching with Dense Transformer Inference Network (2021) In APWeb-WAIM2021 (CCF-C)
  76. Semantic Embedding Graph Convolutional Networks for Multi-label Video Segment Classification (2021) In PAAP2021
  77. Semantic Extraction for Sentence Representation via Reinforcement Learning (2021) In IJCNN2021 (CCF-C)
  78. A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection (2020) In INTERSPEECH2020 (CCF-C)
  79. Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding (2020) In INTERSPEECH2020 (CCF-C)
  80. MLNET: An Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection (2020) In INTERSPEECH2020 (CCF-C)
  81. Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit (2020) In INTERSPEECH2020 (CCF-C)
  82. Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment (2020) In ICASSP2020 (CCF-B)
  83. GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech (2020) In ICASSP2020 (CCF-B)
  84. Chinese Punctuation Prediction with Adaptive Attention and Dependency Tree (2020) In CCKS2020
  85. Epidemic Guard: A COVID-19 Detection System for Elderly People (2020) In APWeb-WAIM2020 (CCF-C)
  86. A Flexible Framework for HMM based Noise Robust Speech Recognition using Generalized Parametric Space Polynomial Regression (2011) In Science China Information Sciences
  87. Generalized Variable Parameter HMMs for Noise Robust Speech Recognition (2011) In INTERSPEECH2011 (CCF-C)
  88. Subspace Noise Estimation and Gamma Distribution-based Microphone Array Post-filter Design (2011) In Chinese Journal of Electronics
  89. Masking Property Based Microphone Array Post-Filter Design (2010) In INTERSPEECH2010 (CCF-C)
  90. Microphone Array Speech Enhancement Based on a Generalized Post-Filter and a Novel Perceptual Filter (2008) In ICOSP2008
  91. An Effective Microphone Array Post-Filter in Arbitrary Environments (2008) In INTERSPEECH2008 (CCF-C)
  92. An Improved A Priori MMSE Spectral Subtraction Method for Speech Enhancement (2007) In IWSDA2007
  93. An Effective Approach for Speech Enhancement by Multi-band MMSE Spectral Subtraction (2007) In NLPKE2007