Prompt Injection

LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks

Authors: Qingzhao Zhang, Ziyang Xiong, Z. Morley Mao | Published: 2024-10-03 | Updated: 2025-04-09

Prompt Injection

Model DoS

2024.10.03 2025.05.27

Literature Database

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Authors: Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, Yongfeng Zhang | Published: 2024-10-03 | Updated: 2025-04-16

Backdoor Attack

Prompt Injection

2024.10.03 2025.05.27

Literature Database

Optimizing Adaptive Attacks against Watermarks for Language Models

Authors: Abdulrahman Diaa, Toluwani Aremu, Nils Lukas | Published: 2024-10-03 | Updated: 2025-05-21

LLM Security

Watermarking

Prompt Injection

2024.10.03 2025.05.27

Literature Database

Robust LLM safeguarding via refusal feature adversarial training

Authors: Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda | Published: 2024-09-30 | Updated: 2025-03-20

Prompt Injection

Model Robustness

Adversarial Learning

2024.09.30 2025.05.27

Literature Database

System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective

Authors: Fangzhou Wu, Ethan Cecchetti, Chaowei Xiao | Published: 2024-09-27 | Updated: 2024-10-10

LLM Security

Prompt Injection

Execution Trace Interference

2024.09.27 2025.05.27

Literature Database

An Adversarial Perspective on Machine Unlearning for AI Safety

Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando | Published: 2024-09-26 | Updated: 2025-04-10

Prompt Injection

Safety Alignment

Machine Unlearning

2024.09.26 2025.05.27

Literature Database

Weak-to-Strong Backdoor Attack for Large Language Models

Authors: Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Luwei Xiao, Xiaoyu Xu, Cong-Duy Nguyen, Luu Anh Tuan | Published: 2024-09-26 | Updated: 2024-10-13

Backdoor Attack

Prompt Injection

2024.09.26 2025.05.27

Literature Database

MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks

Authors: Giandomenico Cornacchia, Giulio Zizzo, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Mark Purcell | Published: 2024-09-26 | Updated: 2024-10-04

Guardrail Method

Content Moderation

Prompt Injection

2024.09.26 2025.05.27

Literature Database

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Authors: Zhihao Lin, Wei Ma, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Yang Liu, Jun Wang, Li Li | Published: 2024-09-21 | Updated: 2024-10-03

LLM Performance Evaluation

Prompt Injection

2024.09.21 2025.05.27

Literature Database

LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems

Authors: Hakan T. Otal, M. Abdullah Canbaz | Published: 2024-09-12 | Updated: 2024-09-15

LLM Security

Cybersecurity

Prompt Injection

2024.09.12 2025.05.27

Literature Database