Ning Cheng

Ning Cheng


He is an accomplished researcher with a rich background in the field of speech technology and natural language processing. With over a decade of experience, he has made substantial contributions to the domains of speech recognition, speech synthesis, and natural language processing. He earned his Bachelor’s degree in Mathematics and Applied Mathematics from Beijing University of Science and Technology in 2003, followed by a Master’s degree in Systems Engineering from the same institution in April 2006. He completed his Ph.D. in Pattern Recognition and Intelligent Systems from the Graduate University of the Chinese Academy of Sciences in July 2009. Throughout his career, he began as an Assistant Researcher at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, in July 2009. Later, he joined Microsoft (China) Co., Ltd., at the Search Technology Center as an Associate Researcher in September 2011. In 2015, he transitioned to the Institute of Software, Chinese Academy of Sciences, where he assumed the role of a Senior Engineer. He joined PAT (Shenzhen) Co., Ltd., in September 2016, as an Associate Researcher.

He has authored more than 80 academic papers presented at top international conferences in the field of speech technology. His work has also been featured in esteemed Chinese journals such as “China Science,” “Journal of Electronics,” and “Acta Automatica Sinica.” In addition to his publication record, he has actively contributed to various research projects, receiving funding from prestigious sources like the National 973 Program, the National 863 Program, and the National Natural Science Foundation of China. Furthermore, he has served as the principal investigator for a research project supported by the Guangdong Provincial Natural Science Foundation.

  • TTS
  • ASR
  • NLP
  • Voice Conversion
  • Artificial Intelligence


  1. Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning (2024) In ACL2024 (CCF-A)
  2. Enhancing Emotion Prediction and Recognition in Conversation through Fine-Grained Emotional Cue Analysis and Cross-Modal Fusion (2024) In ICIC2024 (CCF-C)
  3. RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval (2024) In ICIC2024 (CCF-C)
  4. RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis (2024) In APWeb2024 (CCF-C)
  5. CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition (2024) In IJCNN2024 (CCF-C)
  6. EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning (2024) In IJCNN2024 (CCF-C)
  7. EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization (2024) In IJCNN2024 (CCF-C)
  8. Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation (2024) In IJCNN2024 (CCF-C)
  9. MAIN-VC: Lightweight Speech Representation Disentanglement for One-Shot Voice Conversion (2024) In IJCNN2024 (CCF-C)
  10. QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering (2024) In IJCNN2024 (CCF-C)
  11. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning (2024) In NAACL2024 (CCF-B)
  12. Medical Speech Symptoms Classification via Disentangled Representation (2024) In CSCWD2024 (CCF-C)
  13. EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  14. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  15. Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  16. Leveraging Biases in Large Language Models: bias-kNN for Effective Few-Shot Learning, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  17. Research on Audio Model Generation Technology Based on Hierarchical Federated Framework (2024) In CAAI TIT
  18. On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models (2023) In AAAI2023 (CCF-A)
  19. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion (2023) In MM2023 (CCF-A)
  20. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation (2023) In SpaCCS2023
  21. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding (2023) In ISPA2023 (CCF-C)
  22. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation (2023) In BDCloud2023
  23. PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter (2023) In EMNLP2023 (CCF-B)
  24. AOSR-Net: All-in-One Sandstorm Removal Network (2023) In ICTAI2023 (CCF-C)
  25. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval (2023) In ICTAI2023 (CCF-C)
  26. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework (2023) In ICTAI2023 (CCF-C)
  27. Machine Unlearning Methodology base on Stochastic Teacher Network (2023) In ADMA2023 (CCF-C)
  28. Symbolic and Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music (2023) In ADMA2023 (CCF-C)
  29. Voice Conversion with Denoising Diffusion Probabilistic GAN Models (2023) In ADMA2023 (CCF-C)
  30. Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism (2023) In INTERSPEECH2023 (CCF-C)
  31. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis (2023) In INTERSPEECH2023 (CCF-C)
  32. Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning (2023) In INTERSPEECH2023 (CCF-C)
  33. Prompt Guided Copy Mechanism for Conversational Question Answering (2023) In INTERSPEECH2023 (CCF-C)
  34. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model (2023) In IJCNN2023 (CCF-C)
  35. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy (2023) In ICASSP2023 (CCF-B)
  36. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations (2023) In ICASSP2023 (CCF-B)
  37. Improving Music Genre Classification from Multi-modal Properties of Music and Genre Correlations Perspective (2023) In ICASSP2023 (CCF-B)
  38. Learning Speech Representations with Flexible Hidden Feature Dimensions (2023) In ICASSP2023 (CCF-B)
  39. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis (2023) In ICASSP2023 (CCF-B)
  40. VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization (2023) In ICASSP2023 (CCF-B)
  41. Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data (2022) In MSN2022 (CCF-C)
  42. Improving Imbalanced Text Classification with Dynamic Curriculum Learning (2022) In MSN2022 (CCF-C)
  43. Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach (2022) In MSN2022 (CCF-C)
  44. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition (2022) In MSN2022 (CCF-C)
  45. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse (2022) In MSN2022 (CCF-C)
  46. Semi-Supervised Learning Based on Reference Model for Low-resource TTS (2022) In MSN2022 (CCF-C)
  47. Shallow Diffusion Motion Model for Talking Face Generation from Speech (2022) In APWeb-WAIM2022 (CCF-C)
  48. Boosting Star-GANs for Voice Conversion with Contrastive Discriminator (2022) In ICONIP2022 (CCF-C)
  49. Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar (2022) In ICTAI2022 (CCF-C)
  50. Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (2022) In INTERSPEECH2022 (CCF-C)
  51. Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation (2022) In INTERSPEECH2022 (CCF-C)
  52. Uncertainty Calibration for Deep Audio Classifiers (2022) In INTERSPEECH2022 (CCF-C)
  53. Adaptive Activation Network for Low Resource Multilingual Speech Recognition (2022) In IJCNN2022 (CCF-C)
  54. MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification (2022) In IJCNN2022 (CCF-C)
  55. MetaSID: Singer Identification with Domain Adaptation for Metaverse (2022) In IJCNN2022 (CCF-C)
  56. Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features (2022) In IJCNN2022 (CCF-C)
  57. Speech Augmentation Based Unsupervised Learning for Keyword Spotting (2022) In IJCNN2022 (CCF-C)
  58. SUSing: SU-net for Singing Voice Synthesis (2022) In IJCNN2022 (CCF-C)
  59. TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS (2022) In IJCNN2022 (CCF-C)
  60. AVQVC: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022) In ICASSP2022 (CCF-B)
  61. DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning (2022) In ICASSP2022 (CCF-B)
  62. nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech (2022) In ICASSP2022 (CCF-B)
  63. Self-Attention for Incomplete Utterance Rewriting (2022) In ICASSP2022 (CCF-B)
  64. Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation (2022) In ICTAI2022 (CCF-C)
  65. Supervised Contrastive Meta-learning for Few-Shot Classification (2022) In HPCC2022 (CCF-C)
  66. VU-BERT: A Unified Framework for Visual Dialog (2022) In ICASSP2022 (CCF-B)
  67. CycleGEAN: Cycle Generative Enhanced Adversarial Network for Voice Conversion (2021) In ASRU2021
  68. Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples (2021) In ASRU2021
  69. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training (2021) In ASRU2021
  70. Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation (2021) In INTERSPEECH2021 (CCF-C)
  71. Speech2Video: Cross-Modal Distillation for Speech to Video Generation (2021) In INTERSPEECH2021 (CCF-C)
  72. Variational Information Bottleneck for Effective Low-Resource Audio Classification (2021) In INTERSPEECH2021 (CCF-C)
  73. A Language Model Based Pseudo-Sample Deliberation for Semi-supervised Speech Recognition (2021) In IJCNN2021 (CCF-C)
  74. CACnet: Cube Attentional CNN for Automatic Speech Recognition (2021) In IJCNN2021 (CCF-C)
  75. Loss Prediction: End-to-End Active Learning Approach For Speech Recognition (2021) In IJCNN2021 (CCF-C)
  76. Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition (2021) In IJCNN2021 (CCF-C)
  77. Cross-Language Transfer Learning and Domain Adaptation for End-to-End Automatic Speech Recognition (2021) In ICME2021 (CCF-B)
  78. LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation (2021) In ICASSP2021 (CCF-B)
  79. Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition (2021) In ICASSP2021 (CCF-B)
  80. End-To-End Silent Speech Recognition with Acoustic Sensing (2021) In SLT2021
  81. GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis (2021) In SLT2021
  82. MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution (2021) In SLT2021
  83. Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion (2021) In SLT2021
  84. A Novel Capsule Aggregation Framework for Natural Language Inference (2021) In APWeb-WAIM2021 (CCF-C)
  85. Joint Intent Detection and Slot Filling Based on Continual Learning Model (2021) In ICASSP2021 (CCF-B)
  86. Self-supervised Learning for Semantic Sentence Matching with Dense Transformer Inference Network (2021) In APWeb-WAIM2021 (CCF-C)
  87. Semantic Embedding Graph Convolutional Networks for Multi-label Video Segment Classification (2021) In PAAP2021
  88. Semantic Extraction for Sentence Representation via Reinforcement Learning (2021) In IJCNN2021 (CCF-C)
  89. A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection (2020) In INTERSPEECH2020 (CCF-C)
  90. Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding (2020) In INTERSPEECH2020 (CCF-C)
  91. MLNET: An Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection (2020) In INTERSPEECH2020 (CCF-C)
  92. Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit (2020) In INTERSPEECH2020 (CCF-C)
  93. Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment (2020) In ICASSP2020 (CCF-B)
  94. GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech (2020) In ICASSP2020 (CCF-B)
  95. Chinese Punctuation Prediction with Adaptive Attention and Dependency Tree (2020) In CCKS2020
  96. Epidemic Guard: A COVID-19 Detection System for Elderly People (2020) In APWeb-WAIM2020 (CCF-C)