Xulong Zhang

Executive Director

Xulong Zhang is a Ph.D. in computer application technology from Fudan University under the supervision of Wei Li. His research during the doctoral period focused on music artificial intelligence, specifically on singing voice detection and singer identification under the sub-topics of music information retrieval. Currently, he work as a senior algorithm researcher at PAT. His main project involves researching technology and applications related to text-to-speech and AI music.

He has served as an external corporate mentor for the University of Science and Technology of China since 2021, where he have jointly supervised seven graduate students. Additionally, starting from 2023, he hold the position of external mentor at Tsinghua Shenzhen International Graduate School. He serves as a member of the Federal Data and Federal Intelligence Special Committee, and he was selected for the 2023 Youth Project of the Shanghai Oriental Talent Program. He actively participate in professional organizations and scholarly communities, serving as a reviewer of well-known Jounals and Conferences such as MM, TASLP, ICASSP and EMNLP. He is also a member of CAA (ID:E1412095260M), CCF (ID:N7554M), ACM (ID:5318755) and IEEE (ID:98053721).

Interests

Federated Large Models
Trusted Computing
Graph Computing

Awards

1. 2023 Youth Project of the Shanghai Oriental Talent Program / 2023上海市东方英才青年项目

Publications

Bridging the Modality Gap: Semantic-Calibrated Zero-shot Speech Emotion Captioning, (2025), ✉Corresponding Author, In IJCNN2025 (CCF-C)
Logic Consistency Makes Large Language Models Personalized Reasoning Teachers, (2025), ‡Co-first Author, In IJCNN2025 (CCF-C)
Rano: Restorable Speaker Anonymization via Conditional Invertible Neural Network, (2025), ✉Corresponding Author, In IJCNN2025 (CCF-C)
CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation, (2025), ‡Co-first Author, In ICASSP2025 (CCF-B)
Graph Contrastive Learning with Decoupled Augmentation (2025) In ICASSP2025 (CCF-B)
Homogeneous Graph Extraction: An Approach to Learning Heterogeneous Graph Embedding (2025) In ICASSP2025 (CCF-B)
A Novel Optimization Scheme for Named Entity Recognition with Pre-trained Language Models, (2024), ✉Corresponding Author, In JERA (EI)
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding, (2024), ‡Co-first Author, In EMNLP2024 (CCF-B)
Enhancing Emotion Prediction and Recognition in Conversation through Fine-Grained Emotional Cue Analysis and Cross-Modal Fusion, (2024), ‡Co-first Author, In ICIC2024 (CCF-C)
RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval, (2024), ✉Corresponding Author, In ICIC2024 (CCF-C)
RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis, (2024), ✉Corresponding Author, In APWeb2024 (CCF-C)
CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
MAIN-VC: Lightweight Speech Representation Disentanglement for One-Shot Voice Conversion, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering, (2024), ✉Corresponding Author, In IJCNN2024 (CCF-C)
Medical Speech Symptoms Classification via Disentangled Representation, (2024), ✉Corresponding Author, In CSCWD2024 (CCF-C)
EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis, (2024), ‡Co-first Author, In ICASSP2024 (CCF-B)
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval, (2024), ✉Corresponding Author, In ICASSP2024 (CCF-B)
Research on Audio Model Generation Technology Based on Hierarchical Federated Framework, (2024), ✉Corresponding Author, In CAAI TIT
PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion, (2023), ‡Co-first Author, In MM2023 (CCF-A)
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation, (2023), ‡Co-first Author, In SpaCCS2023
CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding, (2023), ✉Corresponding Author, In ISPA2023 (CCF-C)
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation, (2023), ✉Corresponding Author, In BDCloud2023
AOSR-Net: All-in-One Sandstorm Removal Network, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval, (2023), ‡Co-first Author, In ICTAI2023 (CCF-C)
FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework, (2023), ✉Corresponding Author, In ICTAI2023 (CCF-C)
DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks, (2023), ‡Co-first Author, In arXiv (work in progress)
Sparks of Large Audio Models: A Survey and Outlook (2023) In arXiv (work in progress)
A Hierarchy-based Analysis Approach for Blended Learning: A Case Study with Chinese Students (2023) In APWeb2023 (CCF-C)
An Empirical Study of Attention Networks for Semantic Segmentation (2023) In APWeb2023 (CCF-C)
Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model (2023) In APWeb2023 (CCF-C)
Stock Volatility Prediction Based on Transformer Model Using Mixed-Frequency Data (2023) In APWeb2023 (CCF-C)
Machine Unlearning Methodology base on Stochastic Teacher Network, (2023), †First Author, In ADMA2023 (CCF-C)
Symbolic and Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music, (2023), ‡Co-first Author, In ADMA2023 (CCF-C)
Voice Conversion with Denoising Diffusion Probabilistic GAN Models, (2023), †First Author, In ADMA2023 (CCF-C)
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning, (2023), ‡Co-first Author, In INTERSPEECH2023 (CCF-C)
SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model, (2023), ✉Corresponding Author, In IJCNN2023 (CCF-C)
Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy, (2023), †First Author, In ICASSP2023 (CCF-B)
Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
Improving Music Genre Classification from Multi-modal Properties of Music and Genre Correlations Perspective, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
Learning Speech Representations with Flexible Hidden Feature Dimensions (2023) In ICASSP2023 (CCF-B)
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis, (2023), ‡Co-first Author, In ICASSP2023 (CCF-B)
VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization (2023) In ICASSP2023 (CCF-B)
Melody Generation from Lyrics with Local Interpretability (2023) In TOMM2023 (CCF-B) (IF=4.094)
Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data, (2022), †First Author, In MSN2022 (CCF-C)
Improving Imbalanced Text Classification with Dynamic Curriculum Learning, (2022), †First Author, In MSN2022 (CCF-C)
Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach, (2022), †First Author, In MSN2022 (CCF-C)
Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition, (2022), †First Author, In MSN2022 (CCF-C)
MetaSpeech: Speech Effects Switch Along with Environment for Metaverse, (2022), †First Author, In MSN2022 (CCF-C)
Semi-Supervised Learning Based on Reference Model for Low-resource TTS, (2022), †First Author, In MSN2022 (CCF-C)
Shallow Diffusion Motion Model for Talking Face Generation from Speech, (2022), †First Author, In APWeb-WAIM2022 (CCF-C)
Boosting Star-GANs for Voice Conversion with Contrastive Discriminator (2022) In ICONIP2022 (CCF-C)
Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar, (2022), ‡Co-first Author, In ICTAI2022 (CCF-C)
Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation (2022) In INTERSPEECH2022 (CCF-C)
Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music (2022) In CSMT2022 (Best Paper Award)
MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification, (2022), †First Author, In IJCNN2022 (CCF-C)
MetaSID: Singer Identification with Domain Adaptation for Metaverse, (2022), †First Author, In IJCNN2022 (CCF-C)
Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features, (2022), †First Author, In IJCNN2022 (CCF-C)
SUSing: SU-net for Singing Voice Synthesis, (2022), †First Author, In IJCNN2022 (CCF-C)
TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS, (2022), †First Author, In IJCNN2022 (CCF-C)
AVQVC: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022) In ICASSP2022 (CCF-B)
DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning (2022) In ICASSP2022 (CCF-B)
nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech (2022) In ICASSP2022 (CCF-B)
CycleGEAN: Cycle Generative Enhanced Adversarial Network for Voice Conversion, (2021), †First Author, In ASRU2021
TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training (2021) In ASRU2021
Singer Identification Using Deep Timbre Feature Learning with KNN-NET, (2021), †First Author, In ICASSP2021 (CCF-B)
Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation (2021) In Electronics2021 (IF=2.69)
Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing, (2020), †First Author, In Electronics2020 (IF=2.69)
Singing Voice Detection Using Multi-Feature Deep Fusion with CNN, (2019), †First Author, In CSMT2019
Transfer Learning for Music Classification and Regression Tasks Using Artist Tags (2019) In CSMT2019
A Novel Singer Identification Method Using GMM-UBM, (2018), †First Author, In CSMT2018
A Practical Singing Voice Detection System Based on GRU-RNN (2018) In CSMT2018 (Best Paper Award)
Music Summary Detection with State Space Embedding and Recurrence Plot (2018) In CSMT2018
Reputation revision method for selecting cloud services based on prior knowledge and a market mechanism (2014) In TSWJ2014 (IF=0.44)
An Autonomic Intrusion Detection Model with Multi-Attribute Auction Mechanism (2013) In IJCSI2013
Probability-Symmetric Storage Allocation for Distributed Storage Systems based on Network Coding (2013) In iJOE2013

Xulong Zhang

Executive Director

Publications

Events