JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Authors: Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, Rui Zheng | Published: 2024-11-17 | Updated: 2025-04-24 Disabling Safety Mechanisms of LLMPrompt InjectionLarge Language Model 2024.11.17 2025.05.27 Literature Database
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Authors: Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper | Published: 2024-11-02 | Updated: 2025-05-14 Disabling Safety Mechanisms of LLMPrompt InjectionExploratory Attack 2024.11.02 2025.05.28 Literature Database
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models Authors: Benji Peng, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin | Published: 2024-10-20 | Updated: 2025-05-08 LLM SecurityDisabling Safety Mechanisms of LLMPrompt Injection 2024.10.20 2025.05.27 Literature Database
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs Authors: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye | Published: 2024-10-18 | Updated: 2025-07-08 Disabling Safety Mechanisms of LLMPrompt InjectionPrompt validation 2024.10.18 2025.07.10 Literature Database
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method Authors: Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng | Published: 2024-09-23 | Updated: 2025-05-21 Disabling Safety Mechanisms of LLMModel Performance EvaluationInformation Extraction 2024.09.23 2025.05.27 Literature Database