Xulong Zhang

Xulong Zhang

Executive Director

Xulong Zhang is a Ph.D. in computer application technology from Fudan University under the supervision of Wei Li. His research during the doctoral period focused on music artificial intelligence, specifically on singing voice detection and singer identification under the sub-topics of music information retrieval. Currently, he work as a senior algorithm researcher at PAT. His main project involves researching technology and applications related to text-to-speech and AI music.

He has served as an external corporate mentor for the University of Science and Technology of China since 2021, where he have jointly supervised seven graduate students. Additionally, starting from 2023, he hold the position of external mentor at Tsinghua Shenzhen International Graduate School. He serves as a member of the Federal Data and Federal Intelligence Special Committee, and he was selected for the 2023 Youth Project of the Shanghai Oriental Talent Program. He actively participate in professional organizations and scholarly communities, serving as a reviewer of well-known Jounals and Conferences such as MM, TASLP, ICASSP and EMNLP. He is also a member of CAA (ID:E1412095260M), CCF (ID:N7554M), ACM (ID:5318755) and IEEE (ID:98053721).

Interests
  • Federated Large Models
  • Trusted Computing
  • Graph Computing
Awards

Publications

  1. Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver, (2025), †First Author, In ICONIP2025 (CCF-C)
  2. Knowledge Distillation for Financial Large Language Models: A Systematic Review of Strategies, Applications, and Evaluation (2025), In [J], Frontiers of Information Technology & Electronic Engineering (FITEE) (SCI IF=2.9)
  3. Bridging the Modality Gap: Semantic-Calibrated Zero-shot Speech Emotion Captioning, (2025), ✉Corresponding Author, In IJCNN2025 (CCF-C)
  4. Logic Consistency Makes Large Language Models Personalized Reasoning Teachers, (2025), ‡Co-first Author, In IJCNN2025 (CCF-C)
  5. Rano: Restorable Speaker Anonymization via Conditional Invertible Neural Network, (2025), ✉Corresponding Author, In IJCNN2025 (CCF-C)
  6. CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation, (2025), ‡Co-first Author, In ICASSP2025 (CCF-B)
  7. Graph Contrastive Learning with Decoupled Augmentation (2025), In ICASSP2025 (CCF-B)
  8. Homogeneous Graph Extraction: An Approach to Learning Heterogeneous Graph Embedding (2025), In ICASSP2025 (CCF-B)
  9. A Novel Optimization Scheme for Named Entity Recognition with Pre-trained Language Models, (2024), ✉Corresponding Author, In [J], Journal of Electronic Research and Application (JERA) (EI)
  10. IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding, (2024), ‡Co-first Author, In EMNLP2024 (CCF-B)
  11. Enhancing Emotion Prediction and Recognition in Conversation through Fine-Grained Emotional Cue Analysis and Cross-Modal Fusion, (2024), ‡Co-first Author, In ICIC2024 (CCF-C)
  12. RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval, (2024), ✉Corresponding Author, In ICIC2024 (CCF-C)
  13. RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis, (2024), ✉Corresponding Author, In APWeb2024 (CCF-C)
  14. CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  15. EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  16. EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  17. Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  18. MAIN-VC: Lightweight Speech Representation Disentanglement for One-Shot Voice Conversion, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  19. QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
  20. Medical Speech Symptoms Classification via Disentangled Representation, (2024), ✉Corresponding Author, In CSCWD2024 (CCF-C)
  21. EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
  22. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
  23. Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
  24. Research on Audio Model Generation Technology Based on Hierarchical Federated Framework, (2024), ✉Corresponding Author, In CAAI TIT
  25. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion, (2023), ‡Co-first Author, In MM2023 (CCF-A)
  26. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation, (2023), ‡Co-first Author, In SpaCCS2023
  27. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding, (2023), ✉Corresponding Author, In ISPA2023 (CCF-C)
  28. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation, (2023), ✉Corresponding Author, In BDCloud2023
  29. AOSR-Net: All-in-One Sandstorm Removal Network, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
  30. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
  31. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework, (2023), ✉Corresponding Author, In ICTAI2023 (CCF-C)
  32. DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks, (2023), ‡Co-first Author, In arXiv (work in progress)
  33. Sparks of Large Audio Models: A Survey and Outlook (2023), In arXiv (work in progress)
  34. A Hierarchy-based Analysis Approach for Blended Learning: A Case Study with Chinese Students (2023), In APWeb2023 (CCF-C)
  35. An Empirical Study of Attention Networks for Semantic Segmentation (2023), In APWeb2023 (CCF-C)
  36. Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model (2023), In APWeb2023 (CCF-C)
  37. Stock Volatility Prediction Based on Transformer Model Using Mixed-Frequency Data (2023), In APWeb2023 (CCF-C)
  38. Machine Unlearning Methodology base on Stochastic Teacher Network, (2023), †First Author, In ADMA2023 (CCF-C)
  39. Symbolic and Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music, (2023), ‡Co-first Author, In ADMA2023 (CCF-C)
  40. Voice Conversion with Denoising Diffusion Probabilistic GAN Models, (2023), †First Author, In ADMA2023 (CCF-C)
  41. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
  42. Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
  43. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model, (2023), ✉Corresponding Author, In IJCNN2023 (CCF-C)
  44. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy, (2023), †First Author, In ICASSP2023 (CCF-B)
  45. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  46. Improving Music Genre Classification from Multi-modal Properties of Music and Genre Correlations Perspective, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  47. Learning Speech Representations with Flexible Hidden Feature Dimensions (2023), In ICASSP2023 (CCF-B)
  48. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
  49. VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization (2023), In ICASSP2023 (CCF-B)
  50. Melody Generation from Lyrics with Local Interpretability (2023), In [J], ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM2023) (CCF-B) (IF=4.094)
  51. Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data, (2022), †First Author, In MSN2022 (CCF-C)
  52. Improving Imbalanced Text Classification with Dynamic Curriculum Learning, (2022), †First Author, In MSN2022 (CCF-C)
  53. Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach, (2022), †First Author, In MSN2022 (CCF-C)
  54. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition, (2022), †First Author, In MSN2022 (CCF-C)
  55. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse, (2022), †First Author, In MSN2022 (CCF-C)
  56. Semi-Supervised Learning Based on Reference Model for Low-resource TTS, (2022), †First Author, In MSN2022 (CCF-C)
  57. Shallow Diffusion Motion Model for Talking Face Generation from Speech, (2022), †First Author, In APWeb-WAIM2022 (CCF-C)
  58. Boosting Star-GANs for Voice Conversion with Contrastive Discriminator (2022), In ICONIP2022 (CCF-C)
  59. Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar, (2022), ‡Co-first Author, In ICTAI2022 (CCF-C)
  60. Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation (2022), In INTERSPEECH2022 (CCF-C)
  61. Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music (2022), In CSMT2022 (Best Paper Award)
  62. MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification, (2022), †First Author, In IJCNN2022 (CCF-C)
  63. MetaSID: Singer Identification with Domain Adaptation for Metaverse, (2022), †First Author, In IJCNN2022 (CCF-C)
  64. Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features, (2022), †First Author, In IJCNN2022 (CCF-C)
  65. SUSing: SU-net for Singing Voice Synthesis, (2022), †First Author, In IJCNN2022 (CCF-C)
  66. TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS, (2022), †First Author, In IJCNN2022 (CCF-C)
  67. AVQVC: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022), In ICASSP2022 (CCF-B)
  68. DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning (2022), In ICASSP2022 (CCF-B)
  69. nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech (2022), In ICASSP2022 (CCF-B)
  70. CycleGEAN: Cycle Generative Enhanced Adversarial Network for Voice Conversion, (2021), †First Author, In ASRU2021
  71. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training (2021), In ASRU2021
  72. Singer Identification Using Deep Timbre Feature Learning with KNN-NET, (2021), †First Author, In ICASSP2021 (CCF-B)
  73. Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation (2021), In [J],Electronics2021 (IF=2.69)
  74. Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing, (2020), †First Author, In [J],Electronics2020 (IF=2.69)
  75. Singing Voice Detection Using Multi-Feature Deep Fusion with CNN, (2019), †First Author, In CSMT2019
  76. Transfer Learning for Music Classification and Regression Tasks Using Artist Tags (2019), In CSMT2019
  77. A Novel Singer Identification Method Using GMM-UBM, (2018), †First Author, In CSMT2018
  78. A Practical Singing Voice Detection System Based on GRU-RNN (2018), In CSMT2018 (Best Paper Award)
  79. Music Summary Detection with State Space Embedding and Recurrence Plot (2018), In CSMT2018
  80. Reputation revision method for selecting cloud services based on prior knowledge and a market mechanism (2014), In TSWJ2014 (IF=0.44)
  81. An Autonomic Intrusion Detection Model with Multi-Attribute Auction Mechanism (2013), In IJCSI2013
  82. Probability-Symmetric Storage Allocation for Distributed Storage Systems based on Network Coding (2013), In iJOE2013

中文期刊文章

  1. 人工智能生成式内容技术概述, (2025), †First Author, 《大数据》(CCF-T2)
  2. 基于One-Class学习的鲁棒音频真伪识别 (2025), 《大数据》,11 (03),(CCF-T2)
  3. 基于可逆网络双嵌入和攻击层的鲁棒音频水印方法, (2025), †First Author, 《大数据》,11 (04),(CCF-T2)
  4. 基于多模态大模型的具身智能体研究进展与展望 (2025), 《大数据》,11 (03),(CCF-T2)
  5. 基于大模型的具身智能任务规划研究:从单智能体到多智能体 (2025), 《大数据》,11 (02),(CCF-T2)
  6. 基于深度卷积和自注意力机制的端到端地震波降噪方法 (2025), 《大数据》(CCF-T2)
  7. 大语言模型长文本推断优化技术综述 (2025), 《大数据》(CCF-T2)
  8. 沙尘图像视觉增强技术综述 (2025), 《大数据》,11 (01),(CCF-T2)
  9. 深度伪造音频生成与鉴伪技术综述 (2025), 《大数据》,11 (05),(CCF-T2)
  10. 深度图表示学习:方法、应用与挑战, (2025), †First Author, 《大数据》(CCF-T2)
  11. 视频深度伪造检测的泛化性问题:方法、挑战与技术进展 (2025), 《大数据》(CCF-T2)
  12. 基于分层联邦框架的音频模型生成技术研究 (2024), 《智能系统学报》(CCF-T2,北大核心)
  13. 基于生成对抗网络的多特征融合去雾技术 (2024), 《大数据》,10 (04),(CCF-T2)
  14. 情感语音合成综述 (2024), 《大数据》,10 (05),(CCF-T2)
  15. 数字说话人脸生成技术综述 (2024), 《大数据》,10 (05),(CCF-T2)
  16. 面向非平行语料的语音转换技术综述 (2024), 《大数据》,10 (03),(CCF-T2)
  17. 基于数字孪生技术的元宇宙空气污染物浓度推断模型 (2023), 《大数据》,9 (01),(CCF-T2)
  18. 基于算力网络的元宇宙分层处理模型设计 (2023), 《大数据》,9 (01),(CCF-T2)
  19. 虚拟人形象合成技术综述 (2023), 《大数据》,9 (03),(CCF-T2)
  20. 表现性语音合成综述 (2023), 《大数据》,9 (06),(CCF-T2)
  21. 基于U-Net和BGRU-RNN的实用歌声检测系统 (2019), 《微型电脑应用》
  22. 数据增强基础上使用卷积神经网络进行闻诊(英文) (2019), 《复旦学报(自然科学版)》(北大核心)
  23. 用于检测音乐借用中短相似片段的方法(英文) (2019), 《复旦学报(自然科学版)》(北大核心)
  24. 流行音乐主旋律提取技术综述 (2017), 《计算机科学》(CCF-T2,北大核心)
  25. 基于擦除码的高效云存储数据冗余方案 (2015), 《计算机工程与设计》(CCF-T3,北大核心)
  26. 一种面向云服务的自主信誉管理机制 (2013), 《武汉大学学报(理学版)》(北大核心)

Events