LLAM

The Lab of Large Audio Model (LLAM) is committed to create innovative solutions that enhance privacy, security, and efficiency in decentralized and complex systems.

Recent News

All news»

[08/11/2025] $∙$ We are thrilled to announce that our paper, “Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries”, has been accepted to AAAI 2026! Vista addresses the unique challenges of streaming video QA—sequential frames and arbitrary query timing—by introducing (1) scene-aware segmentation that clusters frames into temporally and visually coherent units, (2) scene-aware compression that stores compact scene tokens in GPU memory while offloading full-resolution frames to CPU, and (3) scene-aware recall that selectively reintegrates relevant scenes at query time. Vista is model-agnostic, integrates with diverse vision–language backbones, and enables long-context, low-latency reasoning; experiments on StreamingBench show state-of-the-art results, establishing a strong baseline for real‑world streaming video understanding.

[21/08/2025] $∙$ We are thrilled to announce that our paper, “EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition”, has been accepted to EMNLP 2025! In this work, we tackle key challenges in speech emotion recognition (SER) faced by large audio-language models (LALMs), such as emotional ambiguity and limited reasoning in smaller model architectures. By integrating emotion-constrained group-relative policy optimization into pretrained LALMs, EMO-RL significantly enhances emotional reasoning and stability during training.

[26/07/2025] $∙$ Our latest paper, “Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver,” has been accepted for presentation at the prestigious International Conference on Neural Information Processing (ICONIP 2025)! In this work, we introduce a novel ODE solver that dramatically accelerates diffusion-based TTS models while preserving audio quality.

[26/06/2025] $∙$ Our latest paper, “Federated Domain Generalization with Domain-Specific Soft Prompts Generation,” has been accepted for presentation at the prestigious International Conference on Computer Vision (ICCV 2025)! The work represents a return to core research in Federated Learning, now powerfully combined with the cutting-edge field of Large Model Prompt Engineering. We introduce a novel framework that generates domain-specific soft prompts within the federated setting, significantly enhancing model generalization capabilities across unseen domains while preserving data privacy.

[01/06/2025] $∙$ We are delighted to share that our paper, “Publicly Verifiable Private Information Retrieval Protocols Based on Function Secret Sharing,” has been accepted to Inscrypt 2025. This achievement coincides with International Children’s Day, a fitting occasion to celebrate the milestone in our research journey. As a core cryptographic study, our work investigates privacy-preserving mechanisms for federated learning, underscoring the indispensable role of security theory in building trustworthy distributed systems. Through years of exploring federated learning’s challenges, we have affirmed that robust cryptographic frameworks are essential for securing data integrity and protecting user privacy in real-world applications.

Research Direction

Federated Large Models

Research on Federated Large Models focuses on advancing privacy-preserving distributed learning frameworks that enable collaborative training of large-scale AI models across decentralized data sources. This direction integrates cutting-edge techniques in federated learning, differential privacy, and model compression to address challenges in data silos, communication efficiency, and heterogeneous system environments. Key applications include cross-institutional medical analysis, secure financial risk prediction, and edge-device personalized AI services while ensuring strict compliance with data governance regulations.

Trusted Computing

Research on Trusted Computing aims to build secure and verifiable computing systems through hardware-rooted security mechanisms, enclave-based confidential computing, and decentralized trust verification protocols. We focus on designing architectures that guarantee data integrity, execution traceability, and resistance to adversarial attacks across cloud-edge environments. Our innovations are applied to blockchain consensus optimization, privacy-preserving biometric authentication, and AI model provenance tracking, establishing trust foundations for next-generation mission-critical systems.

Graph Computing

Research on Graph Computing explores efficient algorithms and systems for analyzing complex relational data at web-scale. By developing novel graph neural network architectures, dynamic subgraph mining techniques, and heterogeneous graph embedding methods to address challenges in billion-edge network processing, real-time knowledge graph reasoning, and multimodal graph representation learning. Applications span social network fraud detection, drug discovery through molecular interaction networks, and smart city traffic optimization systems.

Large Audio Model

Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing. This research encompasses a wide range of applications, including speech recognition, virtual assistants, music composition, audio synthesis, and more. Within this broad scope, several key areas of focus include: Low resource TTS, Expressive TTS, Voice Conversion, Audio Caption, Speech Security, and Music AI.

Latest Publication

Haocheng Lu, Nan Zhang, Wei Tao, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang

December 2025 In AAAI2026 (CCF-A)

Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries

TBD

Xulong Zhang, Jiashu Wang, Xiaoyang Qu, Hui Tian, Jianzong Wang

November 2025 In ICONIP2025 (CCF-C)

Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver

This paper introduces Turbo-TTS, a novel diffusion-based model for text-to-speech (TTS) synthesis. Diffusion models leverage stochastic differential equations (SDEs) to generate high-fidelity speech. To enhance the sampling efficiency of the diffusion process, we propose a new ordinary differential equation (ODE) solver and integrate consistency modeling principles into the TTS framework, leading to significant improvements in synthesized speech quality. Our approach discretizes the underlying SDE describing diffusion into a probability flow ODE (PF ODE). This PF ODE shares the same marginal distribution as the original SDE but offers improved tractability for numerical solution. Experimental evaluations demonstrate that Turbo-TTS produces high-quality speech with substantially reduced computational requirements. The model achieves low-latency synthesis through single-step sampling (NFE = 1, RTF = 0.0074), indicating strong suitability for real-time applications.

Pengcheng Li~, Botao Zhao, Zuheng Kang, Junqing Peng, Xiaoyang Qu, Yayun He, Jianzong Wang

November 2025 In EMNLP2025 (CCF-B)

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition

Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs’ reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.

Jiaqi Shi, Xulong Zhang, Xiaoyang Qu, Junfei Xie, Jianzong Wang

October 2025 In [J], Frontiers of Information Technology & Electronic Engineering (FITEE) (SCI IF=2.9)

Knowledge Distillation for Financial Large Language Models: A Systematic Review of Strategies, Applications, and Evaluation

TBD

Jianhan Wu, Xiaoyang Qu, Zhangcheng Huang, Jianzong Wang

October 2025 In ICCV2025 (CCF-A)

Federated Domain Generalization with Domain-specific Soft Prompts Generation

Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.

See all publications

Recent & Upcoming Events

Jianzong Wang

Jan 20, 2026 — Jan 27, 2026 Singapore

AAAI 2026

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

Jianzong Wang, Xulong Zhang

Nov 20, 2025 — Nov 24, 2025 Okinawa, Japan

ICONIP 2025

The International Conference on Neural Information Processing (ICONIP) is an annual conference of the Asia Pacific Neural Network Society (APNNS). ICONIP brings together attendees from around the world, diverse disciplines and professions including researchers, academics, and industry experts, all working collaboratively to tackle real-world challenges and to contribute to the society.

Jianzong Wang, Botao Zhao, Zuheng Kang, Yayun He

Nov 5, 2025 — Nov 9, 2024 Suzhou, China

EMNLP 2025

The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) is set to be a major event for researchers, practitioners, and enthusiasts in the field of natural language processing (NLP). Taking place from November 5th to 9th in Suzhou, China, this conference promises to showcase cutting-edge research, innovative applications, and thought-provoking discussions.

Jianzong Wang, Jianhan Wu

Oct 19, 2025 — Oct 23, 2025 Honolulu, Hawai'i, USA

ICCV 2025

ICCV is hosted by the Institute of Electrical and Electronics Engineers (IEEE). It is the premier international computer vision event, and its proceedings represent the latest development trends and highest level in the field of computer vision. It is highly regarded in the industry and is the top - level conference with the lowest acceptance rate among the three major computer vision conferences.

Jianzong Wang

Oct 19, 2025 — Oct 21, 2025 Xi'an, China

Inscrypt 2025

The 21st International Conference on Information Security and Cryptology (INSCRYPT 2025) will be held in Xi’an from October 19th to October 21st, 2025, organized by the State Key Laboratry of Integrated Services Networks (ISN) of Xidian University and the State Key Laboratory of Cyberspace Security Defense (SKLCSD) of the Institute of Information Engineering of Chinese Academy of Science. Inscrypt 2025 seeks high-quality research contributions in the form of well developed papers. Topics of interest encompass research advances in ALL areas of information security, cryptology, and their applications. The conference proceedings will be published by Springer-Verlag in LNCS series.

See all events

Meet the team →

👋 Welcome to the group

Take a look at workplaces in our lab…

Lab of Large Audio Model

Share your knowledge with the group and explore exciting new topics together!

Join Us