Large Language Model

“Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence

Authors: Shaopeng Fu, Liang Ding, Di Wang | Published: 2025-02-06
Prompt Injection
Large Language Model
Adversarial Training

LLM Safety Alignment is Divergence Estimation in Disguise

Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing | Published: 2025-02-02
Prompt Injection
Convergence Analysis
Large Language Model
Safety Alignment

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

Authors: Huandong Wang, Wenjie Fu, Yingzhou Tang, Zhilong Chen, Yuxi Huang, Jinghua Piao, Chen Gao, Fengli Xu, Tao Jiang, Yong Li | Published: 2025-01-16
Survey Paper
Privacy Protection
Prompt Injection
Large Language Model

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

Authors: Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici | Published: 2025-01-14
Cybersecurity
Privacy Protection
Large Language Model

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang | Published: 2025-01-13
Cybersecurity
Large Language Model
Attack Evaluation

SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

Authors: Xiaoning Dong, Wenbo Hu, Wei Xu, Tianxing He | Published: 2024-12-19 | Updated: 2025-03-21
Prompt Injection
Large Language Model
Adversarial Learning

Towards Action Hijacking of Large Language Model-based Agent

Authors: Yuyang Zhang, Kangjie Chen, Jiaxin Gao, Ronghao Cui, Run Wang, Lina Wang, Tianwei Zhang | Published: 2024-12-14 | Updated: 2025-06-12
Performance Evaluation
Prompt leaking
Large Language Model

“Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Authors: Libo Wang | Published: 2024-11-23 | Updated: 2025-03-20
Prompt Injection
Large Language Model

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Authors: Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, Rui Zheng | Published: 2024-11-17 | Updated: 2025-04-24
Disabling Safety Mechanisms of LLM
Prompt Injection
Large Language Model

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Authors: Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen | Published: 2024-11-01 | Updated: 2025-04-23
Indirect Prompt Injection
Large Language Model
Attention Mechanism