“Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence Authors: Shaopeng Fu, Liang Ding, Di Wang | Published: 2025-02-06 Prompt InjectionLarge Language ModelAdversarial Training 2025.02.06 2025.05.27 Literature Database
LLM Safety Alignment is Divergence Estimation in Disguise Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing | Published: 2025-02-02 Prompt InjectionConvergence AnalysisLarge Language ModelSafety Alignment 2025.02.02 2025.05.27 Literature Database
A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy Authors: Huandong Wang, Wenjie Fu, Yingzhou Tang, Zhilong Chen, Yuxi Huang, Jinghua Piao, Chen Gao, Fengli Xu, Tao Jiang, Yong Li | Published: 2025-01-16 Survey PaperPrivacy ProtectionPrompt InjectionLarge Language Model 2025.01.16 2025.05.27 Literature Database
Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack Authors: Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici | Published: 2025-01-14 CybersecurityPrivacy ProtectionLarge Language Model 2025.01.14 2025.05.27 Literature Database
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang | Published: 2025-01-13 CybersecurityLarge Language ModelAttack Evaluation 2025.01.13 2025.05.27 Literature Database
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage Authors: Xiaoning Dong, Wenbo Hu, Wei Xu, Tianxing He | Published: 2024-12-19 | Updated: 2025-03-21 Prompt InjectionLarge Language ModelAdversarial Learning 2024.12.19 2025.05.27 Literature Database
Towards Action Hijacking of Large Language Model-based Agent Authors: Yuyang Zhang, Kangjie Chen, Jiaxin Gao, Ronghao Cui, Run Wang, Lina Wang, Tianwei Zhang | Published: 2024-12-14 | Updated: 2025-06-12 Performance EvaluationPrompt leakingLarge Language Model 2024.12.14 2025.06.14 Literature Database
“Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks Authors: Libo Wang | Published: 2024-11-23 | Updated: 2025-03-20 Prompt InjectionLarge Language Model 2024.11.23 2025.05.27 Literature Database
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Authors: Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, Rui Zheng | Published: 2024-11-17 | Updated: 2025-04-24 Disabling Safety Mechanisms of LLMPrompt InjectionLarge Language Model 2024.11.17 2025.05.27 Literature Database
Attention Tracker: Detecting Prompt Injection Attacks in LLMs Authors: Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen | Published: 2024-11-01 | Updated: 2025-04-23 Indirect Prompt InjectionLarge Language ModelAttention Mechanism 2024.11.01 2025.05.27 Literature Database