Home
People
Events
Research
Publications
Contact
News
1
Confusion-Aware In-Context-Learning for Vision-Language Models in Robotic Manipulation
TBD
Yayun He
,
Zuheng Kang
,
Botao Zhao
,
Zhouyin Wu
,
Junqing Peng
,
Jianzong Wang
Cite
Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in …
Qingran Yang
,
Botao Zhao
,
Zuheng Kang
,
Xue Li
,
Yayun He
,
Chuhang Liu
,
Xulong Zhang
,
Xiaoyang Qu
,
Junqing Peng
,
Jianzong Wang
Cite
arXiv
CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control
Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision …
Jiaqi Shi
,
Xulong Zhang
,
Xiaoyang Qu
,
Jianzong Wang
Cite
arXiv
From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA Models
While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by …
Wentao Zhang
,
Aolan Sun
,
Wentao Mo
,
Xiaoyang Qu
,
Yuxin Zheng
,
Jianzong Wang
Cite
arXiv
Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained …
Junfei Xie
,
Peng Pan
,
Xulong Zhang
Cite
arXiv
MirrorTalk: Forging Personalized Avatars via Disentangled Style and Hierarchical Motion Control
Synthesizing personalized talking faces that uphold and highlight a speaker’s unique style while maintaining lip-sync accuracy …
Renjie Lu
,
Xulong Zhang
,
Xiaoyang Qu
,
Jianzong Wang
,
Shangfei Wang
Cite
arXiv
Mita: A Hierarchical Multi-Agent Collaboration Framework with Memory-Integrated and Task Allocation
Recent advances in large language models (LLMs) have substantially accelerated the development of embodied agents. LLM-based …
Xiaojie Zhang
,
Jianhan Wu
,
Xiaoyang Qu
,
Jianzong Wang
Cite
arXiv
Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models
Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which …
Anmin Wang
,
Nan Zhang
,
Wei Tao
,
Xiaoyang Qu
,
Guokuan Li
,
Jiguang Wan
,
Jianzong Wang
Cite
arXiv
Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries
TBD
Haocheng Lu
,
Nan Zhang
,
Wei Tao
,
Xiaoyang Qu
,
Guokuan Li
,
Jiguang Wan
,
Jianzong Wang
Cite
Turbo-TTS: Enhancing Diffusion Model TTS with an Improved ODE Solver
This paper introduces Turbo-TTS, a novel diffusion-based model for text-to-speech (TTS) synthesis. Diffusion models leverage stochastic …
Xulong Zhang
,
Jiashu Wang
,
Xiaoyang Qu
,
Hui Tian
,
Jianzong Wang
Cite
Springer
»
Cite
×