AIセキュリティポータル K Program
VoiceWukong: Benchmarking Deepfake Voice Detection
Share
Abstract
With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.
Transferring audio deepfake detection capability across languages
Zhongjie Ba, Qing Wen, Peng Cheng, Yuwei Wang, Feng Lin, Li Lu, Zhenguang Liu
Published: 2023
wav2vec 2.0: A framework for self-supervised learning of speech representations
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
Published: 2020
Better speech synthesis through scaling
James Betker
Published: 2023
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, Hao Zheng
Published: 2017
Speaker verification against synthetic speech
Lian-Wu Chen, Wu Guo, Li-Rong Dai
Published: 2010
Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019
Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng
Published: 2019
Diff-hiervc: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation
Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
Published: 2023
Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion
Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
Published: 2024
Deepfake speech detection through emotion recognition: a semantic approach
Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C Stamm, Stefano Tubaro
Published: 2022
Synthetic speech discrimination using pitch pattern statistics derived from image analysis
Phillip L De Leon, Bryan Stewart, Junichi Yamagishi
Published: 2012
Towards benchmarking and evaluating deepfake detection
Jingyi Deng, Chenhao Lin, Pengbin Hu, Chao Shen, Qian Wang, Qi Li, Qiming Li
Published: 2024
Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing
Siwen Ding, You Zhang, Zhiyao Duan
Published: 2023
Bts-e: Audio deepfake detection using breathing-talking-silence encoder
Thien-Phuc Doan, Long Nguyen-Vu, Souhwan Jung, Kihun Hong
Published: 2023
A review of time-scale modification of music signals
Jonathan Driedger, Meinard Müller
Published: 2016
Res2net: A new multi-scale backbone architecture
Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr
Published: 2019
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Published: 2016
Towards end-to-end synthetic speech detection
Guang Hua, Andrew Beng Jin Teoh, Haijian Zhang
Published: 2021
Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas Evans
Published: 2022
Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu
Published: 2020
Audio deepfakes: A survey
Zahra Khanjani, Gabrielle Watson, Vandana P Janeja
Published: 2023
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech
Jaehyeon Kim, Jungil Kong, Juhee Son
Published: 2021
Phase-aware spoof speech detection based on res2net with phase network
Juntae Kim, Sung Min Ban
Published: 2023
A continual deepfake detection benchmark: Dataset, methods, and essentials
Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, Luc Van Gool
Published: 2023
Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
Published: 2023
Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion
Yinghao Aaron Li, Ali Zare, Nima Mesgarani
Published: 2021
Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans
Songxiang Liu, Dan Su, Dong Yu
Published: 2022
Novel technique of customizing the audio fade-out shape
Lucian Lup¸sa-Tataru
Published: 2018
Implementing the fade-in audio effect for real-time computing
Lucian Lup¸sa-Tataru
Published: 2019
Magicdata mandarin chinese read speech corpus
Magic Data Technology Co., Ltd.
Published: 2019
The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge
Juan M Martín-Doñas, Aitor Álvarez
Published: 2022
Optimization of false acceptance/rejection rates and decision threshold for end-to-end text-dependent speaker verification systems
Victoria Mingote, Antonio Miguel, Dayana Ribas, Alfonso Ortega Giménez, Eduardo Lleida
Published: 2019
Speaker recognition-assisted robust audio deepfake detection
Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, Jianhua Tao
Published: 2022
Deepfake generation and detection: Case study and challenges
Yogesh Patel, Sudeep Tanwar, Rajesh Gupta, Pronaya Bhattacharya, Innocent Ewean Davidson, Royi Nyameko, Srinivas Aluvala, Vrince Vimal
Published: 2023
Deepfake generation and detection: A benchmark and survey
Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yunsheng Wu, Guangtao Zhai, Jian Yang, Chunhua Shen, Dacheng Tao
Published: 2024
Esc: Dataset for environmental sound classification
Karol J Piczak
Published: 2015
For: A dataset for synthetic speech detection
Ricardo Reimao, Vassilios Tzerpos
Published: 2019
Ai-synthesized voice detection using neural vocoder artifacts
Chengzhe Sun, Shan Jia, Shuwei Hou, Siwei Lyu
Published: 2023
Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing
Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, Nicholas Evans
Published: 2022
Graph attention networks for anti-spoofing
Hemlata Tak, Jee weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans
Published: 2021
Spoofing detection from a feature representation perspective
Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li
Published: 2016
Stc antispoofing systems for the asvspoof2021 challenge
Anton Tomilov, Aleksei Svishchev, Marina Volkova, Artem Chirkovskiy, Alexander Kondratev, Galina Lavrentyeva
Published: 2021
Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald
Published: 2016
Speaker verification performance degradation against spoofing and tampering attacks
Jesús Villalba, Eduardo Lleida
Published: 2010
Voicepop: A pop noise based anti-spoofing system for voice authentication on smartphones
Qian Wang, Xiu Lin, Man Zhou, Yanjiao Chen, Cong Wang, Qi Li, Xiangyang Luo
Published: 2019
Deepsonar: Towards effective and robust detection of ai-synthesized fake voices
Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, Yang Liu
Published: 2020
A history of audio effects
Thomas Wilmering, David Moffat, Alessia Milo, Mark B Sandler
Published: 2020
Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans
Published: 2021
Add 2022: the first audio deep synthesis detection challenge
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan
Published: 2022
Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection
Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan
Published: 2024
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao
Published: 2023.12.15
One-class learning towards synthetic voice spoofing detection
You Zhang, Fei Jiang, Zhiyao Duan
Published: 2021
Share