The Lab of Large Audio Model (LLAM) is committed to create innovative solutions that enhance privacy, security, and efficiency in decentralized and complex systems.

[17/03/2026] $\bullet$ We are thrilled to announce that our paper, “VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success,” has been accepted to ICME 2026! This work introduces a novel training-free method that leverages image entropy to quantify visual token informativeness and attention entropy to capture semantic relevance, enabling dynamic inference acceleration for Vision-Language-Action models by reducing redundancy while maintaining critical spatial, semantic, and temporal cues. Extensive experiments demonstrate significant improvements in inference efficiency and performance over existing approaches.
[11/03/2026] $\bullet$ We are thrilled to share that our latest research paper, titled “From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration,” is set to be accepted and presented at the upcoming 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026).
[01/02/2026] $\bullet$ We are pleased to announce that our paper, “Confusion-Aware In-Context Learning for Vision-Language Models in Robotic Manipulation,” has been accepted by CSCWD 2026. This work addresses a critical robustness issue in vision-language-model-based robotic manipulation, particularly the frequent failures caused by confusable objects. A novel framework that explicitly localizes and analyzes sources of confusion and incorporates this information into the in-context prompts of VLMs, guiding them to attend to discriminative features.
[18/01/2026] $\bullet$ What a fantastic start to the new year! We are thrilled to announce that 7 submissions have been officially accepted to ICASSP 2026. Particularly as we pivot our focus toward the frontiers of Embodied AI, Multi-agent Systems, and Multimodal Large Language Models. Our accepted works dive deep into the next generation of AI—ranging from personalized digital humans and robotic control to self-correcting VLA models. We are excited to head to Barcelona this May to present our research, visit the iconic Camp Nou, and reconnect with the global community! Accepted papers: MirrorTalk: Forging Personalized Avatars via Disentangled Style and Hierarchical Motion Control, CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control,Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models, Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage, Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition, From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA Models and Mita: A Hierarchical Multi-Agent Collaboration Framework with Memory-Integrated and Task Allocation.
[08/11/2025] $\bullet$ We are thrilled to announce that our paper, “Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries”, has been accepted to AAAI 2026! Vista addresses the unique challenges of streaming video QA—sequential frames and arbitrary query timing—by introducing (1) scene-aware segmentation that clusters frames into temporally and visually coherent units, (2) scene-aware compression that stores compact scene tokens in GPU memory while offloading full-resolution frames to CPU, and (3) scene-aware recall that selectively reintegrates relevant scenes at query time. Vista is model-agnostic, integrates with diverse vision–language backbones, and enables long-context, low-latency reasoning; experiments on StreamingBench show state-of-the-art results, establishing a strong baseline for real‑world streaming video understanding.
Research on Federated Large Models focuses on advancing privacy-preserving distributed learning frameworks that enable collaborative training of large-scale AI models across decentralized data sources. This direction integrates cutting-edge techniques in federated learning, differential privacy, and model compression to address challenges in data silos, communication efficiency, and heterogeneous system environments. Key applications include cross-institutional medical analysis, secure financial risk prediction, and edge-device personalized AI services while ensuring strict compliance with data governance regulations.
Research on Trusted Computing aims to build secure and verifiable computing systems through hardware-rooted security mechanisms, enclave-based confidential computing, and decentralized trust verification protocols. We focus on designing architectures that guarantee data integrity, execution traceability, and resistance to adversarial attacks across cloud-edge environments. Our innovations are applied to blockchain consensus optimization, privacy-preserving biometric authentication, and AI model provenance tracking, establishing trust foundations for next-generation mission-critical systems.
Research on Graph Computing explores efficient algorithms and systems for analyzing complex relational data at web-scale. By developing novel graph neural network architectures, dynamic subgraph mining techniques, and heterogeneous graph embedding methods to address challenges in billion-edge network processing, real-time knowledge graph reasoning, and multimodal graph representation learning. Applications span social network fraud detection, drug discovery through molecular interaction networks, and smart city traffic optimization systems.
Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing. This research encompasses a wide range of applications, including speech recognition, virtual assistants, music composition, audio synthesis, and more. Within this broad scope, several key areas of focus include: Low resource TTS, Expressive TTS, Voice Conversion, Audio Caption, Speech Security, and Music AI.

Vision-language models (VLMs) have significantly improved the generalization capabilities of robotic manipulation. However, VLM-based systems often suffer from a lack of robustness, leading to unpredictable errors, particularly in scenarios involving confusable objects. Our preliminary analysis reveals that these failures are mainly caused by shortcut learning problem inherently in VLMs, limiting their ability to accurately distinguish between confusable features. To this end, we propose Confusion-Aware In-Context Learning (CAICL), a method that enhances VLM performance in confusable scenarios for robotic manipulation. The approach begins with confusion localization and analysis, identifying potential sources of confusion. This information is then used as a prompt for the VLM to focus on features most likely to cause misidentification. Extensive experiments on the VIMA-Bench show that CAICL effectively addresses the shortcut learning issue, achieving a 85.5% success rate and showing good stability across tasks with different degrees of generalization.

Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases, it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.

ICME 2026 will bring together leading researchers and practitioners to share the latest developments and advances in the discipline. Featuring high-quality oral and poster sessions, world-class keynotes, exhibitions, demonstrations, and tutorials, the conference will attract leading researchers and global industry figures, providing excellent networking opportunities. In addition, exceptional papers and contributors will be selected and recognized with prestigious awards.

The Association for Computational Linguistics (ACL) was established in 1962 and is the premier conference in the field of natural language processing (NLP) and computational linguistics. It is organized annually by the Association for Computational Linguistics. The ACL is one of the most influential and dynamic international academic organizations in the world. It holds an annual conference every summer, providing a platform for scholars to present papers and share the latest research findings. The association boasts members from over 60 countries and regions worldwide, representing the highest level of international computational linguistics in the NLP field.

Design of complex artifacts and systems requires the cooperation of multidisciplinary design teams. The 2026 29th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2026) provides a forum for researchers and practitioners involved in different but related domains to confront research results and discuss key problems. The scope of CSCWD 2026 includes the research and development of collaboration technologies and their applications to the design of processes, products, systems, and services in industries and societies. Collaboration technologies include theories, methods, mechanisms, protocols, software tools, platforms, and services that support communication, coordination and collaboration among people, software and hardware systems. Related fields of research include human-computer interaction, business process management, collaborative virtual environments, enterprise modeling, s ecurity and privacy, as well as social aspects and human factors related to collaboration and design.

Join the world’s largest and most comprehensive technical conference dedicated to signal processing, built on 50 years of innovation and research dissemination. ICASSP has the highest h-Index of any conference in the signal processing field.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.