The Lab of Large Audio Model (LLAM) is committed to create innovative solutions that enhance privacy, security, and efficiency in decentralized and complex systems.

[08/11/2025] $\bullet$ We are thrilled to announce that our paper, “Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries”, has been accepted to AAAI 2026! Vista addresses the unique challenges of streaming video QA—sequential frames and arbitrary query timing—by introducing (1) scene-aware segmentation that clusters frames into temporally and visually coherent units, (2) scene-aware compression that stores compact scene tokens in GPU memory while offloading full-resolution frames to CPU, and (3) scene-aware recall that selectively reintegrates relevant scenes at query time. Vista is model-agnostic, integrates with diverse vision–language backbones, and enables long-context, low-latency reasoning; experiments on StreamingBench show state-of-the-art results, establishing a strong baseline for real‑world streaming video understanding.
[21/08/2025] $\bullet$ We are thrilled to announce that our paper, “EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition”, has been accepted to EMNLP 2025! In this work, we tackle key challenges in speech emotion recognition (SER) faced by large audio-language models (LALMs), such as emotional ambiguity and limited reasoning in smaller model architectures. By integrating emotion-constrained group-relative policy optimization into pretrained LALMs, EMO-RL significantly enhances emotional reasoning and stability during training.
[26/07/2025] $\bullet$ Our latest paper, “Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver,” has been accepted for presentation at the prestigious International Conference on Neural Information Processing (ICONIP 2025)! In this work, we introduce a novel ODE solver that dramatically accelerates diffusion-based TTS models while preserving audio quality.
[26/06/2025] $\bullet$ Our latest paper, “Federated Domain Generalization with Domain-Specific Soft Prompts Generation,” has been accepted for presentation at the prestigious International Conference on Computer Vision (ICCV 2025)! The work represents a return to core research in Federated Learning, now powerfully combined with the cutting-edge field of Large Model Prompt Engineering. We introduce a novel framework that generates domain-specific soft prompts within the federated setting, significantly enhancing model generalization capabilities across unseen domains while preserving data privacy.
[01/06/2025] $\bullet$ We are delighted to share that our paper, “Publicly Verifiable Private Information Retrieval Protocols Based on Function Secret Sharing,” has been accepted to Inscrypt 2025. This achievement coincides with International Children’s Day, a fitting occasion to celebrate the milestone in our research journey. As a core cryptographic study, our work investigates privacy-preserving mechanisms for federated learning, underscoring the indispensable role of security theory in building trustworthy distributed systems. Through years of exploring federated learning’s challenges, we have affirmed that robust cryptographic frameworks are essential for securing data integrity and protecting user privacy in real-world applications.
Research on Federated Large Models focuses on advancing privacy-preserving distributed learning frameworks that enable collaborative training of large-scale AI models across decentralized data sources. This direction integrates cutting-edge techniques in federated learning, differential privacy, and model compression to address challenges in data silos, communication efficiency, and heterogeneous system environments. Key applications include cross-institutional medical analysis, secure financial risk prediction, and edge-device personalized AI services while ensuring strict compliance with data governance regulations.
Research on Trusted Computing aims to build secure and verifiable computing systems through hardware-rooted security mechanisms, enclave-based confidential computing, and decentralized trust verification protocols. We focus on designing architectures that guarantee data integrity, execution traceability, and resistance to adversarial attacks across cloud-edge environments. Our innovations are applied to blockchain consensus optimization, privacy-preserving biometric authentication, and AI model provenance tracking, establishing trust foundations for next-generation mission-critical systems.
Research on Graph Computing explores efficient algorithms and systems for analyzing complex relational data at web-scale. By developing novel graph neural network architectures, dynamic subgraph mining techniques, and heterogeneous graph embedding methods to address challenges in billion-edge network processing, real-time knowledge graph reasoning, and multimodal graph representation learning. Applications span social network fraud detection, drug discovery through molecular interaction networks, and smart city traffic optimization systems.
Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing. This research encompasses a wide range of applications, including speech recognition, virtual assistants, music composition, audio synthesis, and more. Within this broad scope, several key areas of focus include: Low resource TTS, Expressive TTS, Voice Conversion, Audio Caption, Speech Security, and Music AI.

This paper introduces Turbo-TTS, a novel diffusion-based model for text-to-speech (TTS) synthesis. Diffusion models leverage stochastic differential equations (SDEs) to generate high-fidelity speech. To enhance the sampling efficiency of the diffusion process, we propose a new ordinary differential equation (ODE) solver and integrate consistency modeling principles into the TTS framework, leading to significant improvements in synthesized speech quality. Our approach discretizes the underlying SDE describing diffusion into a probability flow ODE (PF ODE). This PF ODE shares the same marginal distribution as the original SDE but offers improved tractability for numerical solution. Experimental evaluations demonstrate that Turbo-TTS produces high-quality speech with substantially reduced computational requirements. The model achieves low-latency synthesis through single-step sampling (NFE = 1, RTF = 0.0074), indicating strong suitability for real-time applications.

Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs’ reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.
The International Conference on Neural Information Processing (ICONIP) is an annual conference of the Asia Pacific Neural Network Society (APNNS). ICONIP brings together attendees from around the world, diverse disciplines and professions including researchers, academics, and industry experts, all working collaboratively to tackle real-world challenges and to contribute to the society.

The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) is set to be a major event for researchers, practitioners, and enthusiasts in the field of natural language processing (NLP). Taking place from November 5th to 9th in Suzhou, China, this conference promises to showcase cutting-edge research, innovative applications, and thought-provoking discussions.

ICCV is hosted by the Institute of Electrical and Electronics Engineers (IEEE). It is the premier international computer vision event, and its proceedings represent the latest development trends and highest level in the field of computer vision. It is highly regarded in the industry and is the top - level conference with the lowest acceptance rate among the three major computer vision conferences.

The 21st International Conference on Information Security and Cryptology (INSCRYPT 2025) will be held in Xi’an from October 19th to October 21st, 2025, organized by the State Key Laboratry of Integrated Services Networks (ISN) of Xidian University and the State Key Laboratory of Cyberspace Security Defense (SKLCSD) of the Institute of Information Engineering of Chinese Academy of Science. Inscrypt 2025 seeks high-quality research contributions in the form of well developed papers. Topics of interest encompass research advances in ALL areas of information security, cryptology, and their applications. The conference proceedings will be published by Springer-Verlag in LNCS series.