Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Neurips Safe Generative AI Workshop

Can editing LLMs inject harm?

Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu

Published: 2024

Uniedit: A unified knowledge editing benchmark for large language models

Qizhou Chen, Dakan Wang, Taolin Zhang, Zaoming Yan, Chengsong You, Chengyu Wang, Xiaofeng He

Published: 2025

NAACL

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova

Published: 2019

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

Published: 2021

Transactions of the Association for Computational Linguistics

Evaluating the ripple effects of knowledge editing in language models

Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, Mor Geva

Published: 2024

Machine learning challenges workshop

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, Bernardo Magnini

Published: 2005

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge neurons in pretrained transformers

Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., Wei, F.

Published: 2022

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

Pokemqa: Programmable knowledge editing for multi-hop question answering

Hengrui Gu, Kaixiong Zhou, Xiaotian Han, Ninghao Liu, Ruobing Wang, Xin Wang

Published: 2024

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Model editing harms general abilities of large language models: Regularization to the rescue

Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng

Published: 2024

Findings of the Association for Computational Linguistics: ACL 2024

Model editing at scale leads to gradual and catastrophic forgetting

Akshat Gupta, Anurag Rao, Gopala Anumanchipalli

Published: 2024

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

"flex tape can’t fix that": Bias and misinformation in edited language models

Karina Halevy, Anna Sotnikova, Badr AlKhamissi, Syrielle Montariol, Antoine Bosselut

Published: 2024

Findings of the Association for Computational Linguistics ACL 2024

Sowing the wind, reaping the whirlwind: The impact of editing language models

Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria

Published: 2024

Findings of the Association for Computational Linguistics: EMNLP 2024

Editing the mind of giants: An in-depth exploration of pitfalls of knowledge editing in large language models

Cheng-Hsun Hsueh, Paul Kuo-Ming Huang, Tzu-Han Lin, Che-Wei Liao, Hung-Chieh Fang, Chao-Wei Huang, Yun-Nung Chen

Published: 2024

Model editing as a double-edged sword: Steering agent ethical behavior toward beneficence or harm

Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu

Published: 2025

Advances in Neural Information Processing Systems

Vlkeb: A large vision-language model knowledge editing benchmark

Han Huang, Haitian Zhong, Tao Yu, Qiang Liu, Shu Wu, Liang Wang, Tieniu Tan

Published: 2024

ACM Transactions on Information Systems

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Huang, L., Yu, W., Ma, W.

Published: 2025

The Fourteenth International Conference on Learning Representations

Dualedit: Mitigating safety fallback in llm backdoor editing via affirmation-refusal regulation

Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Xiang Wang, Xiangnan He, Yang Deng

Published: 2026

Flooding spread of manipulated knowledge in llm-based multi-agent communities

Ju, T., Wang, Y., Ma, X., Cheng, P., Zhao, H., Wang, Y., Liu, L., Xie, J., Zhang, Z., Liu, G.

Published: 2024

Transactions of the Association for Computational Linguistics

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al.

Published: 2019

arXiv

Badedit: Backdooring large language models by model editing

Y. Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, Y. Liu

Published: 2024

The Twelfth International Conference on Learning Representations

Unveiling the pitfalls of knowledge editing for large language models

Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, Huajun Chen

Published: 2024

Advances in neural information processing systems

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov

Published: 2022

arXiv

Mass-editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau

Published: 2022

International Conference on Learning Representations

Fast model editing at scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D Manning

Published: 2022

Megen: Generative backdoor in large language models via model editing

Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang, Hai Zhao

Published: 2024

Gemma: Open models based on gemini research and technology

Gemma-Team

Published: 2024

arXiv

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar

Published: 2023

Advances in Neural Information Processing Systems

Wise: Rethinking the knowledge memory for lifelong model editing of large language models

Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Published: 2024

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

Easyedit: An easy-to-use knowledge editing framework for large language models

Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng

Published: 2024

Deepedit: Knowledge editing as decoding with constraints

Yiwei Wang, Muhao Chen, Nanyun Peng, Kai-Wei Chang

Published: 2024

Findings of the Association for Computational Linguistics: ACL 2024

The butterfly effect of model editing: Few edits can trigger large language models collapse

Wanli Yang, Fei Sun, Xinyu Ma, Xun Liu, Dawei Yin, Xueqi Cheng

Published: 2024

The mirage of model editing: Revisiting evaluation in the wild

Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng

Published: 2025

Forty-second International Conference on Machine Learning Position Paper Track

Position: Editing large language models poses serious safety risks

Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert

Published: 2025

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

How to make llms forget: On reversing in-context knowledge edits

Paul Youssef, Zhixue Zhao, Jörg Schlötterer, Christin Seifert

Published: 2025

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

Has this fact been edited? detecting knowledge edits in language models

Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer

Published: 2025

A comprehensive study of knowledge editing for large language models

Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni

Published: 2024

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Can we edit factual knowledge by in-context learning?

Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, Baobao Chang

Published: 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Mquake: Assessing knowledge editing in language models via multi-hop questions

Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, Danqi Chen

Published: 2023