Retrieval-Augmented Audio Deepfake Detection

The overview of the RAG and RAD pipeline

Abstract

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.

Type
Publication
In Proceedings of the 2024 ACM International Conference on Multimedia Retrieval
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Zuheng Kang
Zuheng Kang
Researcher
Yayun He
Yayun He
Researcher
Botao Zhao
Botao Zhao
Researcher