The Lab of Large Audio Model (LLAM) is committed to create innovative solutions that enhance privacy, security, and efficiency in decentralized and complex systems.

[01/02/2026] $\bullet$ We are pleased to announce that our paper, “Confusion-Aware In-Context Learning for Vision-Language Models in Robotic Manipulation,” has been accepted by CSCWD 2026. This work addresses a critical robustness issue in vision-language-model-based robotic manipulation, particularly the frequent failures caused by confusable objects. A novel framework that explicitly localizes and analyzes sources of confusion and incorporates this information into the in-context prompts of VLMs, guiding them to attend to discriminative features.
[18/01/2026] $\bullet$ What a fantastic start to the new year! We are thrilled to announce that 7 submissions have been officially accepted to ICASSP 2026. Particularly as we pivot our focus toward the frontiers of Embodied AI, Multi-agent Systems, and Multimodal Large Language Models. Our accepted works dive deep into the next generation of AI—ranging from personalized digital humans and robotic control to self-correcting VLA models. We are excited to head to Barcelona this May to present our research, visit the iconic Camp Nou, and reconnect with the global community! Accepted papers: MirrorTalk: Forging Personalized Avatars via Disentangled Style and Hierarchical Motion Control, CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control,Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models, Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage, Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition, From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA Models and Mita: A Hierarchical Multi-Agent Collaboration Framework with Memory-Integrated and Task Allocation.
[08/11/2025] $\bullet$ We are thrilled to announce that our paper, “Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries”, has been accepted to AAAI 2026! Vista addresses the unique challenges of streaming video QA—sequential frames and arbitrary query timing—by introducing (1) scene-aware segmentation that clusters frames into temporally and visually coherent units, (2) scene-aware compression that stores compact scene tokens in GPU memory while offloading full-resolution frames to CPU, and (3) scene-aware recall that selectively reintegrates relevant scenes at query time. Vista is model-agnostic, integrates with diverse vision–language backbones, and enables long-context, low-latency reasoning; experiments on StreamingBench show state-of-the-art results, establishing a strong baseline for real‑world streaming video understanding.
[21/08/2025] $\bullet$ We are thrilled to announce that our paper, “EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition”, has been accepted to EMNLP 2025! In this work, we tackle key challenges in speech emotion recognition (SER) faced by large audio-language models (LALMs), such as emotional ambiguity and limited reasoning in smaller model architectures. By integrating emotion-constrained group-relative policy optimization into pretrained LALMs, EMO-RL significantly enhances emotional reasoning and stability during training.
[26/07/2025] $\bullet$ Our latest paper, “Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver,” has been accepted for presentation at the prestigious International Conference on Neural Information Processing (ICONIP 2025)! In this work, we introduce a novel ODE solver that dramatically accelerates diffusion-based TTS models while preserving audio quality.
Research on Federated Large Models focuses on advancing privacy-preserving distributed learning frameworks that enable collaborative training of large-scale AI models across decentralized data sources. This direction integrates cutting-edge techniques in federated learning, differential privacy, and model compression to address challenges in data silos, communication efficiency, and heterogeneous system environments. Key applications include cross-institutional medical analysis, secure financial risk prediction, and edge-device personalized AI services while ensuring strict compliance with data governance regulations.
Research on Trusted Computing aims to build secure and verifiable computing systems through hardware-rooted security mechanisms, enclave-based confidential computing, and decentralized trust verification protocols. We focus on designing architectures that guarantee data integrity, execution traceability, and resistance to adversarial attacks across cloud-edge environments. Our innovations are applied to blockchain consensus optimization, privacy-preserving biometric authentication, and AI model provenance tracking, establishing trust foundations for next-generation mission-critical systems.
Research on Graph Computing explores efficient algorithms and systems for analyzing complex relational data at web-scale. By developing novel graph neural network architectures, dynamic subgraph mining techniques, and heterogeneous graph embedding methods to address challenges in billion-edge network processing, real-time knowledge graph reasoning, and multimodal graph representation learning. Applications span social network fraud detection, drug discovery through molecular interaction networks, and smart city traffic optimization systems.
Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing. This research encompasses a wide range of applications, including speech recognition, virtual assistants, music composition, audio synthesis, and more. Within this broad scope, several key areas of focus include: Low resource TTS, Expressive TTS, Voice Conversion, Audio Caption, Speech Security, and Music AI.

The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.

Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video-text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi-task pretraining objective. During fine-tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE’s superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE’s scalability, interpretability, and effectiveness in robotic control with weak supervision.

While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by two critical weaknesses. first, during grasping tasks, the action tokens generated by the language model often exhibit subtle spatial deviations from the target object, resulting in grasp failures; second, they lack the ability to reliably recognize task completion, which leads to redundant actions and frequent timeout errors. To address these challenges and enhance robustness, we propose a lightweight, training-free framework, VLA-SCT. This framework operates as a self-correcting control loop, combining data-driven action refinement with conditional logic for termination. Consequently, compared to baseline approaches, our method achieves consistent improvements across all datasets in the LIBERO benchmark, significantly increasing the success rate of fine manipulation tasks and ensuring accurate task completion, thereby promoting the deployment of more reliable VLA agents in complex, unstructured environments.

Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.

Design of complex artifacts and systems requires the cooperation of multidisciplinary design teams. The 2026 29th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2026) provides a forum for researchers and practitioners involved in different but related domains to confront research results and discuss key problems. The scope of CSCWD 2026 includes the research and development of collaboration technologies and their applications to the design of processes, products, systems, and services in industries and societies. Collaboration technologies include theories, methods, mechanisms, protocols, software tools, platforms, and services that support communication, coordination and collaboration among people, software and hardware systems. Related fields of research include human-computer interaction, business process management, collaborative virtual environments, enterprise modeling, s ecurity and privacy, as well as social aspects and human factors related to collaboration and design.

Join the world’s largest and most comprehensive technical conference dedicated to signal processing, built on 50 years of innovation and research dissemination. ICASSP has the highest h-Index of any conference in the signal processing field.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.
The International Conference on Neural Information Processing (ICONIP) is an annual conference of the Asia Pacific Neural Network Society (APNNS). ICONIP brings together attendees from around the world, diverse disciplines and professions including researchers, academics, and industry experts, all working collaboratively to tackle real-world challenges and to contribute to the society.

The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) is set to be a major event for researchers, practitioners, and enthusiasts in the field of natural language processing (NLP). Taking place from November 5th to 9th in Suzhou, China, this conference promises to showcase cutting-edge research, innovative applications, and thought-provoking discussions.