Prompt Injection

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

Authors: Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang | Published: 2025-02-24 | Updated: 2025-07-09

Prompt Injection

脱獄手法

Evaluation Method

2025.02.24 2025.07.11

Literature Database

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

Authors: Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehenaz Khaled, Ahmedul Kabir | Published: 2025-02-23 | Updated: 2025-06-12

Prompt Injection

多エージェントシステムの評価

Adversarial Attack Assessment

2025.02.23 2025.06.14

Literature Database

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng | Published: 2025-02-18 | Updated: 2025-05-21

Alignment

Text Generation Method

Prompt Injection

2025.02.18 2025.05.28

Literature Database

DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Authors: Yi Wang, Fenghua Weng, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang | Published: 2025-02-17 | Updated: 2025-05-29

LLM Security

Prompt Injection

Defense Method

2025.02.17 2025.05.31

Literature Database

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

Authors: Andreas Happe, Jürgen Cito | Published: 2025-02-06 | Updated: 2025-09-11

Indirect Prompt Injection

Prompt Injection

攻撃戦略分析

2025.02.06 2025.09.13

Literature Database

“Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence

Authors: Shaopeng Fu, Liang Ding, Di Wang | Published: 2025-02-06

Prompt Injection

Large Language Model

Adversarial Training

2025.02.06 2025.05.27

Literature Database

Safety at Scale: A Comprehensive Survey of Large Model Safety

Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang | Published: 2025-02-02 | Updated: 2025-03-19

Indirect Prompt Injection

Prompt Injection

Attack Method

2025.02.02 2025.05.27

Literature Database

LLM Safety Alignment is Divergence Estimation in Disguise

Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing | Published: 2025-02-02

Prompt Injection

Convergence Analysis

Large Language Model

Safety Alignment

2025.02.02 2025.05.27

Literature Database

Smoothed Embeddings for Robust Language Models

Authors: Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang | Published: 2025-01-27

Prompt Injection

Membership Inference

Adversarial Training

2025.01.27 2025.05.27

Literature Database

TombRaider: Entering the Vault of History to Jailbreak Large Language Models

Authors: Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li | Published: 2025-01-27 | Updated: 2025-08-25

Prompt Injection

Prompt leaking

脱獄手法

2025.01.27 2025.08.27

Literature Database