With the rapid advancement of technologies like text-to-speech (TTS) and
voice conversion (VC), detecting deepfake voices has become increasingly
crucial. However, both academia and industry lack a comprehensive and intuitive
benchmark for evaluating detectors. Existing datasets are limited in language
diversity and lack many manipulations encountered in real-world production
environments.
To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate
the performance of deepfake voice detectors. To build the dataset, we first
collected deepfake voices generated by 19 advanced and widely recognized
commercial tools and 15 open-source tools. We then created 38 data variants
covering six types of manipulations, constructing the evaluation dataset for
deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200
Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12
state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of
13.50%, while all others exceeded 20%. Our findings reveal that these detectors
face significant challenges in real-world applications, with dramatically
declining performance. In addition, we conducted a user study with more than
300 participants. The results are compared with the performance of the 12
detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,
where different detectors and humans exhibit varying identification
capabilities for deepfake voices at different deception levels, while the LALM
demonstrates no detection ability at all. Furthermore, we provide a leaderboard
for deepfake voice detection, publicly available at
{https://voicewukong.github.io}.
wav2vec 2.0: A framework for self-supervised learning of speech representations
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
Published: 2020
Better speech synthesis through scaling
James Betker
Published: 2023
2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA)
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, Hao Zheng
Published: 2017
2010 7th International Symposium on Chinese Spoken Language Processing
ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge
Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans
Published: 2021
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
Add 2022: the first audio deep synthesis detection challenge
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan