The Lab of Large Audio Model (LLAM) is committed to exploring and advancing the forefront and future of audio and sound technology, and building large audio models.
[01/02/2024] $\bullet$ Great news! We are excited to announce that our latest research submission to CSCWD 2024 has been accepted. The 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2024) serves as a platform for researchers and practitioners across diverse domains to present their findings and engage in discussions about crucial issues. The conference’s scope encompasses the research and development of collaborative technologies and their applications in designing processes, products, systems, and services across various industries and societies. The accepted work “Medical Speech Symptoms Classification via Disentangled Representation.” The contribution reflect our commitment to advancing collaboration technologies, exploring innovative methods, and addressing key challenges in diverse fields such as human-computer interaction, business process management, collaborative virtual environments, enterprise modeling, security and privacy, as well as social aspects and human factors associated with collaboration and design. We look forward to participating in CSCWD 2024 and contributing to the vibrant discussions and advancements in the field of computer-supported cooperative work in design.
[24/01/2024] $\bullet$ Exciting News: Our Paper on Hierarchical Federated Framework for Audio Model Generation Technology Accepted by CAAI Transactions on Intelligent Systems. We are thrilled to announce that our research paper, titled “Research on Audio Model Generation Technology Based on Hierarchical Federated Framework,” has been accepted for publication in the prestigious journal, CAAI Transactions on Intelligent Systems. The journal is currently in the process of scheduling the publication date for our groundbreaking work. The focal point of our study centers around audio models, delving into the exploration of next-generation audio generation techniques. The primary objective is to construct a federated audio model training framework that facilitates audio representation learning on a massively scaled audio dataset. This framework aims to provide efficient and robust solutions for various downstream audio tasks. We eagerly anticipate the publication of our paper in CAAI Transactions on Intelligent Systems and look forward to sharing our findings with the broader scientific community.
[13/12/2023] $\bullet$ Breaking news: We are delighted to announce that our team has six papers accepted by ICASSP 2024, according to a preliminary list of accepted papers. ICASSP is the top conference in the field of speech and signal processing, and we congratulate our team for their outstanding achievements at ICASSP. For more details, please refer to the official acceptance notification.
[10/12/2023] $\bullet$ Jianzong Wang, the Honorary Director of the Laboratory, has been awarded the Outstanding Reviewer Award at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). This prestigious award recognizes his excellent contributions to the commnunity by providing high-quality and efficient reviews of competitive paper and symposium submissions for the conference program. EMNLP 2023 is one of the leading conferences in the field of natural language processing, attracting researchers from all over the world to present and discuss their latest findings and innovations. The Outstanding Reviewer Award is given to those reviewers who have demonstrated the highest standards of rigor, relevance, and constructive feedback in their reviews. Jianzong Wang is among the few selected reviewers who have received this honor, which reflects his expertise, dedication, and professionalism in advancing the scientific communication. We congratulate Jianzong Wang on this remarkable achievement and thank him for his valuable service to the commnunity.
[01/12/2023] $\bullet$ We are thrilled to share the fantastic news that our latest paper, titled “Gecko: Resource-Efficient and Accurate Queries in Real-Time Video Streams at the Edge,” has been successfully accepted for inclusion in the technical program of the prestigious IEEE INFOCOM 2024 conference. This achievement not only underscores the dedication and hard work invested in our research but also highlights the significance of our findings in the realm of real-time video stream analysis at the Edge. The acceptance rate for this conference stands at an impressive 19%, further emphasizing the caliber and innovation encapsulated in our work. We extend our heartfelt gratitude to everyone involved in the development of this paper and look forward to the opportunity to present and share our insights with the global community of researchers and professionals at IEEE INFOCOM 2024.
Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing, with the goal of enabling new and innovative applications in areas such as speech recognition, virtual assistants, music composition, audio synthesis, and more.
Research on high-quality audio, few-shot TTS, low resource TTS, and expressive TTS is mainly applied to scenarios such as speech interaction, information broadcasting, and text-to-speech reading, as well as in intelligent voice outbound calls and intelligent agents.
Research that aims to transform the vocal characteristics of a speaker while preserving the linguistic content of their speech. It has various applications in speech processing, including speaker adaptation, voice disguise, and emotion transfer.
Research aims to address various security threats and vulnerabilities associated with speech data, speech recognition systems, and voice communication.
Research topics related to music information retrieval, including song detection, singer identification, main melody extraction, and voice beautification.
Surveillance cameras are ubiquitous nowadays and users’ increasing needs for accessing real-world information (e.g., finding abandoned luggage) have urged object queries in real-time videos. While recent real-time video query processing systems exhibit excellent performance, they lack utility in deployment in practice as they overlook some crucial aspects, including multi-camera exploration, resource contention, and content awareness. Motivated by these issues, we propose a framework Gecko, to provide resource-efficient and accurate real-time object queries of massive videos on edge devices. Gecko (i) obtains optimal models from the model zoo and assigns them to edge devices for executing current queries, (ii) optimizes resource usage of the edge cluster at runtime by dynamically adjusting the frame query interval of each video stream and forking/joining running models on edge devices, and (iii) improves accuracy in changing video scenes by fine-grained stream transfer and continuous learning of models. Our evaluation with real-world video streams and queries shows that Gecko achieves up to 2x more resource efficiency gains and increases overall query accuracy by at least 12% compared with prior work, further delivering excellent scalability for practical deployment.
In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singular emotion, failing to adapt to intricate emotions. To overcome these challenges, this paper proposes EmoTalker, an emotionally editable portraits animation approach based on the diffusion model. EmoTalker modifies the denoising process to ensure preservation of the original portrait’s identity during inference. To enhance emotion comprehension from text input, Emotion Intensity Block is introduced to analyze fine-grained emotions and strengths derived from prompts. Additionally, a crafted dataset is harnessed to enhance emotion comprehension within prompts. Experiments show the effectiveness of EmoTalker in generating high-quality, emotionally customizable facial expressions.
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named “CTVC” which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that “CTVC” outperforms previous studies and improves the sound quality and similarity of converted results.
IEEE INFOCOM is a top-ranked conference on networking in the research community. It is a major conference venue for researchers to present and exchange significant and innovative contributions and ideas in the field of networking and closely related areas. IEEE INFOCOM covers both theoretical and systems research. IEEE INFOCOM 2024 is scheduled to take place at the stunning Hyatt Regency hotel in the vibrant city of Vancouver, Canada. We have had 1 paper accepted.
Design of complex artifacts and systems requires the cooperation of multidisciplinary design teams. The 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2024) provides a forum for researchers and practitioners involved in different but related domains to confront research results and discuss key problems. The scope of CSCWD 2024 includes the research and development of collaboration technologies and their applications to the design of processes, products, systems, and services in industries and societies. Collaboration technologies include theories, methods, mechanisms, protocols, software tools, platforms, and services that support communication, coordination and collaboration among people, software and hardware systems. Related fields of research include human-computer interaction, business process management, collaborative virtual environments, enterprise modeling, security and privacy, as well as social aspects and human factors related to collaboration and design.
ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals annually. The 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024), Seoul, Korea, 14~19 April 2024, is hosted by the IEEE Signal Processing Society. We have had 6 papers accepted.
The 27th DATE conference is the main European event bringing together designers and design automation users, researchers and vendors as well as specialists in hardware and software design, test and manufacturing of electronic circuits and systems. DATE puts strong emphasis on both technology and systems, covering ICs/SoCs, emerging technologies, embedded systems and embedded software. We have had 1 paper accepted.
Take a look at workplaces in our lab…