AIセキュリティポータル K Program
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
Share
Abstract
Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.
Can editing LLMs inject harm?
Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu
Published: 2024
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova
Published: 2019
Evaluating the ripple effects of knowledge editing in language models
Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, Mor Geva
Published: 2024
The pascal recognising textual entailment challenge
Ido Dagan, Oren Glickman, Bernardo Magnini
Published: 2005
Pokemqa: Programmable knowledge editing for multi-hop question answering
Hengrui Gu, Kaixiong Zhou, Xiaotian Han, Ninghao Liu, Ruobing Wang, Xin Wang
Published: 2024
Model editing harms general abilities of large language models: Regularization to the rescue
Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng
Published: 2024
Model editing at scale leads to gradual and catastrophic forgetting
Akshat Gupta, Anurag Rao, Gopala Anumanchipalli
Published: 2024
"flex tape can’t fix that": Bias and misinformation in edited language models
Karina Halevy, Anna Sotnikova, Badr AlKhamissi, Syrielle Montariol, Antoine Bosselut
Published: 2024
Sowing the wind, reaping the whirlwind: The impact of editing language models
Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Published: 2024
Editing the mind of giants: An in-depth exploration of pitfalls of knowledge editing in large language models
Cheng-Hsun Hsueh, Paul Kuo-Ming Huang, Tzu-Han Lin, Che-Wei Liao, Hung-Chieh Fang, Chao-Wei Huang, Yun-Nung Chen
Published: 2024
Vlkeb: A large vision-language model knowledge editing benchmark
Han Huang, Haitian Zhong, Tao Yu, Qiang Liu, Shu Wu, Liang Wang, Tieniu Tan
Published: 2024
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions
Huang, L., Yu, W., Ma, W.
Published: 2025
Dualedit: Mitigating safety fallback in llm backdoor editing via affirmation-refusal regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Xiang Wang, Xiangnan He, Yang Deng
Published: 2026
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al.
Published: 2019
Unveiling the pitfalls of knowledge editing for large language models
Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, Huajun Chen
Published: 2024
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov
Published: 2022
Fast model editing at scale
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D Manning
Published: 2022
Wise: Rethinking the knowledge memory for lifelong model editing of large language models
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Published: 2024
Easyedit: An easy-to-use knowledge editing framework for large language models
Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng
Published: 2024
The butterfly effect of model editing: Few edits can trigger large language models collapse
Wanli Yang, Fei Sun, Xinyu Ma, Xun Liu, Dawei Yin, Xueqi Cheng
Published: 2024
Position: Editing large language models poses serious safety risks
Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert
Published: 2025
How to make llms forget: On reversing in-context knowledge edits
Paul Youssef, Zhixue Zhao, Jörg Schlötterer, Christin Seifert
Published: 2025
Has this fact been edited? detecting knowledge edits in language models
Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer
Published: 2025
Can we edit factual knowledge by in-context learning?
Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, Baobao Chang
Published: 2023
Mquake: Assessing knowledge editing in language models via multi-hop questions
Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, Danqi Chen
Published: 2023
Share