Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen | Published: 2025-09-26 | Updated: 2025-09-30
Disabling Safety Mechanisms of LLM
Self-Attention Mechanism
Interpretability