AIセキュリティポータル K Program
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
Share
Abstract
Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan
Published: 4.13.2022
Universal jailbreak backdoors in large language model alignment
Thomas Baumann
Published: 2024
Enhancing chat language models by scaling high-quality instructional conversations
Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., Zhou, B.
Published: 2023
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly
Published: 2021
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, Surya Nepal
Published: 2.18.2019
Weight poisoning attacks on pretrained models
Keita Kurita, Paul Michel, Graham Neubig
Published: 2020
Analyzing and editing inner mechanisms of backdoored language models
Max Lamparth, Anka Reuel
Published: 2024
Causality based front-door defense against backdoor attack on language models
Yiran Liu, Xiaoang Xu, Zhiyi Hou, Yang Yu
Published: 2024
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov
Published: 2022
Test-time backdoor mitigation for black-box large language models with defensive demonstrations
Wenjie Jacky Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Hadi Askari, Chaowei Xiao, Muhao Chen
Published: 2025
In-context learning and induction heads
C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, C. Olah
Published: 2022
LLMBD: Backdoor defense via large language model paraphrasing and data voting in NLP
Fei Ouyang, Di Zhang, Chunlong Xie, Hao Wang, Tao Xiang
Published: 2025
Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation
Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, Min Yang
Published: 2022
Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation
Bowen Peng, Jeffrey Quesnelle
Published: 2023
Mind the style of text! adversarial and backdoor attacks based on text style transfer
Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, Maosong Sun
Published: 2021
Hidden killer: Invisible textual backdoor attacks with syntactic trigger
Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, Maosong Sun
Published: 2021
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
Published: 5.23.2025
BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target
G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma, X. Zhang
Published: 2025
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu
Published: 2024
Concealed data poisoning attacks on nlp models
Eric Wallace, Tony Zhao, Shi Feng, Sameer Singh
Published: 2021
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, Ben Y Zhao
Published: 2019
From Purity to Peril: Backdooring Merged Models From " Harmless" Benign Components
Lijin Wang, Jingjing Wang, Tianshuo Cong, Xinlei He, Zhan Qin, Xinyi Huang
Published: 2025
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
Published: 7.6.2023
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
Published: 1.20.2024
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin
Published: 8.1.2023
EmbedX: Embedding-Based Cross-Trigger backdoor attack against large language models
Nan Yan, Yuqing Li, Xiong Wang, Jing Chen, Kun He, Bo Li
Published: 2025
Watch out for your agents! investigating backdoor threats to llm-based agents
Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
Published: 2024
Beear: Embedding-based adversarial removal of safety backdoors in instruction-tuned language models
Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, Ruoxi Jia
Published: 2024
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett
Published: 2023
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
Shuai Zhao, Xiaobao Wu, Cong-Duy Nguyen, Yanhao Jia, Meihuizi Jia, Yichao Feng, Luu Anh Tuan
Published: 10.18.2024
Share