The Lab of Large Audio Model (LLAM) is committed to exploring and advancing the forefront and future of audio and sound technology, and building large audio models.
[16/05/2024] $\bullet$ It feels amazing to receive an acceptance notification from a top-tier conference on a weekday afternoon! The latest research paper “Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning,” a collaboration between Ping An Technology’s Dr. Jianzong Wang’s team and Professor Tianyi Zhou’s team from the University of Maryland, has been accepted as a long paper at ACL 2024 CCF Class A paper, with an acceptance rate of less than 20%. This represents a significant breakthrough in the field of instruction-tuning for large models. For the first time, we have revealed the consistency in instruction difficulty perception across models of different scales and achieved over a 20-fold speed improvement in the large model training process through our superfiltering method. This achievement opens up new avenues for data filtering technology. We welcome citations from our peers! Research Highlights: 1. Weak-to-Strong Data Consistency: We discovered that both small and large language models exhibit a high degree of consistency in perceiving and evaluating the difficulty of instruction-tuning data. This finding is crucial for optimizing data filtering processes. 2. Efficient Superfiltering Strategy: We proposed the first superfiltering method that uses small models (e.g., GPT-2) to select data, significantly accelerating the fine-tuning process of large language models. 3. Effectiveness of Selected Training Data: Superfiltering is highly precise in allocating high-quality and information-rich data. Models trained with only 5% of the filtered data performed similarly to or even better than models trained with the entire dataset in multiple benchmark tests. The complete research results and code are publicly available on GitHub: https://github.com/tianyi-lab/Superfiltering. This is our second paper at a top NLP conference. Our team’s collaboration with the University of Maryland has already resulted in a paper published at NAACL, addressing the innovative problem of how to automatically identify high-quality instruction data from datasets during large model training.
[09/05/2024] $\bullet$ The 2024 Twentieth International Conference on Intelligent Computing (ICIC 2024) is scheduled to take place from August 5th to 8th, 2024, in Tianjin, China. In the recently released acceptance notifications, our two latest research endeavors have been selected for oral presentation. They are respectively titled “RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval” and “Enhancing Emotion Prediction and Recognition in Conversation through Fine-Grained Emotional Cue Analysis and Cross-Modal Fusion”. We eagerly anticipate sharing the content of our research achievements with the Intelligent Computing community at ICIC2024.
[02/05/2024] $\bullet$ Groundbreaking Research on Emotion Transfer TTS Model Accepted at APWeb 2024. The Asia Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (APWeb-WAIM) is aiming at attracting professionals of different communities related to Web and Big Data who have common interests in interdisciplinary research to share and exchange ideas, experience and the underlying techniques and applications, including Web technologies, database systems, information management, software engineering and big data. In the latest acceptance notification, our latest paper titled with “RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis” on an advanced Text-to-Speech (TTS) model has been officially accepted by APWeb 2024. The innovative paper introduces a novel emotion transfer TTS model that surpasses traditional limitations experienced in emotion intensity controllable speech synthesis.
[08/04/2024] $\bullet$ We are thrilled to announce that our team’s paper “Retrieval-Augmented Audio Deepfake Detection” has been accepted for the ICMR 2024 conference (CCF-B). This pioneering research addresses the rising concerns surrounding the misuse of hyper-realistic audio deepfakes facilitated by recent advancements in speech synthesis technology. Our proposed innovative Retrieval Augmentation Detection (RAD) framework, inspired by Retrieval Augmentation Generation (RAG) used in Large Language Models (LLMs), significantly enhances deepfake detection by augmenting test samples with highly similar retrieved samples. The integration of multi-fusion attentive classifiers further improves the performance of the entire framework. Extensive experiments demonstrate the superiority of our RAD over baseline approaches, achieving state-of-the-art results on the ASVspoof 2021 DF dataset and competitive results on the 2019 and 2021 LA datasets. This acceptance emphasizes the importance of our research in combating audio deepfakes, offering a promising solution to safeguard the authenticity and credibility of digital content. We look forward to sharing our findings and contributing to the advancements in this field at the ICMR 2024 conference.
[16/03/2024] $\bullet$ Nine Groundbreaking Papers Accepted from Our Team at IJCNN 2024. We are thrilled to announce that our team’s latest submissions to the International Joint Conference on Neural Networks (IJCNN) 2024 have been met with exceptional success, with a total of 10 papers accepted for presentation. IJCNN stands as the foremost international conference dedicated to the theory, analysis, and applications of neural networks. The accepted works span a diverse array of cutting-edge research topics, ranging from speech recognition and conversion to enhancing singing voices, 3D action recognition, extractive question answering, and federated learning. These papers represent the forefront of innovation in artificial intelligence and its practical applications. Here is a glimpse of the accepted papers:Task-Agnostic Decision Transformer for Multi-Type Agent Control with Federated Split Training, QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering, PRENet: A Plane-Fit Redundancy Encoding Point Cloud Sequence Network for Real-Time 3D Action Recognition, MAIN-VC: Lightweight Speech Representation Disentanglement for One-Shot Voice Conversion, Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation, EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization, Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning, EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning, Enhancing Anomalous Sound Detection with Multi-Level Memory Bank, CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition. We are dedicated to providing detailed insights into our research, and we intend to release the final versions of these papers on arXiv soon. This will allow for further discussion, collaboration, and exploration of the groundbreaking ideas presented in our work. We invite fellow researchers, practitioners, and enthusiasts to engage with us in exploring the frontier of neural networks and artificial intelligence. Your insights and feedback are invaluable as we collectively strive to push the boundaries of what is possible in this rapidly evolving field.
Research on Large Audio Models aims to advance the field of audio processing, generation, understanding, and multimodal processing, with the goal of enabling new and innovative applications in areas such as speech recognition, virtual assistants, music composition, audio synthesis, and more.
Research on high-quality audio, few-shot TTS, low resource TTS, and expressive TTS is mainly applied to scenarios such as speech interaction, information broadcasting, and text-to-speech reading, as well as in intelligent voice outbound calls and intelligent agents.
Research that aims to transform the vocal characteristics of a speaker while preserving the linguistic content of their speech. It has various applications in speech processing, including speaker adaptation, voice disguise, and emotion transfer.
Research aims to address various security threats and vulnerabilities associated with speech data, speech recognition systems, and voice communication.
Research topics related to music information retrieval, including song detection, singer identification, main melody extraction, and voice beautification.
Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering{:} Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach.
With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
The Asia Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (APWeb-WAIM) is aiming at attracting professionals of different communities related to Web and Big Data who have common interests in interdisciplinary research to share and exchange ideas, experience and the underlying techniques and applications, including Web technologies, database systems, information management, software engineering and big data. The 8th APWeb-WAIM joint international conference on Web and Big Data 2024 will be held in Jinhua, China, August30-Septemper 1, 2024.
The Association for Computational Linguistics (ACL) was established in 1962 and is the premier conference in the field of natural language processing (NLP) and computational linguistics. It is organized annually by the Association for Computational Linguistics. The ACL is one of the most influential and dynamic international academic organizations in the world. It holds an annual conference every summer, providing a platform for scholars to present papers and share the latest research findings. The association boasts members from over 60 countries and regions worldwide, representing the highest level of international computational linguistics in the NLP field.
The 2024 Twentieth International Conference on Intelligent Computing (ICIC 2024) will be held during August 5-8, 2024, Tianjin, China. The conference will be financially supported by the Natural Science Foundation of China (Natural Science Foundation of China). The theme for this conference is Advanced Intelligent Computing Methodologies and Applications. Original contributions related to this theme are especially solicited, including theories, methodologies, and applications in science and technology. Topics covering industrial issues/applications and academic research into intelligent computing will be welcome. The conference proceedings will be published by Springer Verlag, including Lecture Notes in Computer Sciences (LNCS)/ Lecture Notes in Artificial Intelligence (LNAI)/ Lecture Notes in Bioinformatics (LNBI). All submissions will be peer-reviewed by experts on the basis of originality, significance and clarity, and only those papers presenting novel research results or successful innovative applications will be accepted for publication.
IJCNN, the leading international conference on neural network theory, analysis, and applications, will take place from June 30 to July 5, 2024, Yokohama, Japan. The International Joint Conference on Neural Networks (IJCNN) covers a wide range of topics in the field of neural networks, from biological neural networks to artificial neural computation. We are excited to present our 9 accepted papers, focusing on TTS and Federated Learning.
NAACL 2024 invites the submission of long and short papers featuring substantial, original, and unpublished research in all aspects of Computational Linguistics and Natural Language Processing. NAACL 2024 has a goal of a diverse technical program—in addition to traditional research results, papers may contribute negative findings, survey an area, announce the creation of a new resource, argue a position, report novel linguistic insights derived using existing computational techniques, and reproduce, or fail to reproduce, previous results.