Xulong Zhang

Xulong Zhang

Executive Director

Xulong Zhang is a Ph.D. in computer application technology from Fudan University under the supervision of Wei Li. His research during the doctoral period focused on music artificial intelligence, specifically on singing voice detection and singer identification under the sub-topics of music information retrieval. Currently, he work as a senior algorithm researcher at PAT. His main project involves researching technology and applications related to text-to-speech and AI music.

He has served as an external corporate mentor for the University of Science and Technology of China since 2021, where he have jointly supervised seven graduate students. Additionally, starting from 2023, he hold the position of external mentor at Tsinghua Shenzhen International Graduate School. He serves as a member of the Federal Data and Federal Intelligence Special Committee, and he was selected for the 2023 Youth Project of the Shanghai Oriental Talent Program. He actively participate in professional organizations and scholarly communities, serving as a reviewer of well-known Jounals and Conferences such as MM, TASLP, ICASSP and EMNLP. He is also a member of CAA (ID:E1412095260M), CCF (ID:N7554M), ACM (ID:5318755) and IEEE (ID:98053721).

  • LLM
  • Speech
  • Embodied AI
  • Music AI
  • Medical Speech
  • Speech Security


  1. Enhancing Emotion Prediction and Recognition in Conversation through Fine-Grained Emotional Cue Analysis and Cross-Modal Fusion, (2024), ‡Co-first Author, In ICIC2024 (CCF-C)
  2. RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval, (2024), ✉Corresponding Author, In ICIC2024 (CCF-C)
  3. RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis, (2024), ✉Corresponding Author, In APWeb2024 (CCF-C)
  4. CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  5. EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  6. EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  7. Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  8. MAIN-VC: Lightweight Speech Representation Disentanglement for One-Shot Voice Conversion, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  9. QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  10. Medical Speech Symptoms Classification via Disentangled Representation, (2024), ✉Corresponding Author, In CSCWD2024 (CCF-C)
  11. EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
  12. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
  13. Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  14. Research on Audio Model Generation Technology Based on Hierarchical Federated Framework, (2024), ✉Corresponding Author, In CAAI TIT
  15. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion, (2023), ‡Co-first Author, In MM2023 (CCF-A)
  16. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation, (2023), ‡Co-first Author, In SpaCCS2023
  17. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding, (2023), ✉Corresponding Author, In ISPA2023 (CCF-C)
  18. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation, (2023), ✉Corresponding Author, In BDCloud2023
  19. AOSR-Net: All-in-One Sandstorm Removal Network, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
  20. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
  21. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework, (2023), ✉Corresponding Author, In ICTAI2023 (CCF-C)
  22. Sparks of Large Audio Models: A Survey and Outlook (2023) In arXiv (work in progress)
  23. A Hierarchy-based Analysis Approach for Blended Learning: A Case Study with Chinese Students (2023) In APWeb2023 (CCF-C)
  24. An Empirical Study of Attention Networks for Semantic Segmentation (2023) In APWeb2023 (CCF-C)
  25. Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model (2023) In APWeb2023 (CCF-C)
  26. Stock Volatility Prediction Based on Transformer Model Using Mixed-Frequency Data (2023) In APWeb2023 (CCF-C)
  27. Machine Unlearning Methodology base on Stochastic Teacher Network, (2023), †First Author, In ADMA2023 (CCF-C)
  28. Symbolic and Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music, (2023), ‡Co-first Author, In ADMA2023 (CCF-C)
  29. Voice Conversion with Denoising Diffusion Probabilistic GAN Models, (2023), †First Author, In ADMA2023 (CCF-C)
  30. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
  31. Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
  32. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model, (2023), ✉Corresponding Author, In IJCNN2023 (CCF-C)
  33. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy, (2023), †First Author, In ICASSP2023 (CCF-B)
  34. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  35. Improving Music Genre Classification from Multi-modal Properties of Music and Genre Correlations Perspective, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  36. Learning Speech Representations with Flexible Hidden Feature Dimensions (2023) In ICASSP2023 (CCF-B)
  37. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  38. VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization (2023) In ICASSP2023 (CCF-B)
  39. Melody Generation from Lyrics with Local Interpretability (2023) In TOMM2023 (CCF-B) (IF=4.094)
  40. Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data, (2022), †First Author, In MSN2022 (CCF-C)
  41. Improving Imbalanced Text Classification with Dynamic Curriculum Learning, (2022), †First Author, In MSN2022 (CCF-C)
  42. Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach, (2022), †First Author, In MSN2022 (CCF-C)
  43. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition, (2022), †First Author, In MSN2022 (CCF-C)
  44. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse, (2022), †First Author, In MSN2022 (CCF-C)
  45. Semi-Supervised Learning Based on Reference Model for Low-resource TTS, (2022), †First Author, In MSN2022 (CCF-C)
  46. Shallow Diffusion Motion Model for Talking Face Generation from Speech, (2022), †First Author, In APWeb-WAIM2022 (CCF-C)
  47. Boosting Star-GANs for Voice Conversion with Contrastive Discriminator (2022) In ICONIP2022 (CCF-C)
  48. Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar, (2022), ‡Co-first Author, In ICTAI2022 (CCF-C)
  49. Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation (2022) In INTERSPEECH2022 (CCF-C)
  50. Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music (2022) In CSMT2022 (Best Paper Award)
  51. MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification, (2022), †First Author, In IJCNN2022 (CCF-C)
  52. MetaSID: Singer Identification with Domain Adaptation for Metaverse, (2022), †First Author, In IJCNN2022 (CCF-C)
  53. Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features, (2022), †First Author, In IJCNN2022 (CCF-C)
  54. SUSing: SU-net for Singing Voice Synthesis, (2022), †First Author, In IJCNN2022 (CCF-C)
  55. TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS, (2022), †First Author, In IJCNN2022 (CCF-C)
  56. AVQVC: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022) In ICASSP2022 (CCF-B)
  57. DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning (2022) In ICASSP2022 (CCF-B)
  58. nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech (2022) In ICASSP2022 (CCF-B)
  59. CycleGEAN: Cycle Generative Enhanced Adversarial Network for Voice Conversion, (2021), †First Author, In ASRU2021
  60. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training (2021) In ASRU2021
  61. Singer Identification Using Deep Timbre Feature Learning with KNN-NET, (2021), †First Author, In ICASSP2021 (CCF-B)
  62. Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation (2021) In Electronics2021 (IF=2.69)
  63. Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing, (2020), †First Author, In Electronics2020 (IF=2.69)
  64. Singing Voice Detection Using Multi-Feature Deep Fusion with CNN, (2019), †First Author, In CSMT2019
  65. Transfer Learning for Music Classification and Regression Tasks Using Artist Tags (2019) In CSMT2019
  66. A Novel Singer Identification Method Using GMM-UBM, (2018), †First Author, In CSMT2018
  67. A Practical Singing Voice Detection System Based on GRU-RNN (2018) In CSMT2018 (Best Paper Award)
  68. Music Summary Detection with State Space Embedding and Recurrence Plot (2018) In CSMT2018
  69. Reputation revision method for selecting cloud services based on prior knowledge and a market mechanism (2014) In TSWJ2014 (IF=0.44)
  70. An Autonomic Intrusion Detection Model with Multi-Attribute Auction Mechanism (2013) In IJCSI2013
  71. Probability-Symmetric Storage Allocation for Distributed Storage Systems based on Network Coding (2013) In iJOE2013