Attack Method

MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Authors: Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang | Published: 2025-03-17 | Updated: 2025-05-20
Prompt Injection
Large Language Model
Attack Method

Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

Authors: Juhee Kim, Woohyuk Choi, Byoungyoung Lee | Published: 2025-03-17 | Updated: 2025-04-21
Indirect Prompt Injection
Data Flow Analysis
Attack Method

BLIA: Detect model memorization in binary classification model through passive Label Inference attack

Authors: Mohammad Wahiduzzaman Khan, Sheng Chen, Ilya Mironov, Leizhen Zhang, Rabib Noor | Published: 2025-03-17
Data Curation
Differential Privacy
Attack Method

Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Wu | Published: 2025-03-15
Data Generation Method
Membership Disclosure Risk
Attack Method

Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection

Authors: Tianwei Lan, Luca Demetrio, Farid Nait-Abdesselam, Yufei Han, Simone Aonzo | Published: 2025-03-14
Backdoor Attack
Label
Attack Method

Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Authors: Andy Zhou, Ron Arel | Published: 2025-03-13 | Updated: 2025-05-21
Disabling Safety Mechanisms of LLM
Attack Method
Generative Model

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

Authors: Jeonghwan Park, Niall McLaughlin, Ihsen Alouani | Published: 2025-03-04 | Updated: 2025-03-16
Attack Method
Adversarial Example Detection
Deep Learning

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Authors: Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi | Published: 2025-02-23
Prompt validation
Malicious Prompt
Attack Method

Safety at Scale: A Comprehensive Survey of Large Model Safety

Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang | Published: 2025-02-02 | Updated: 2025-03-19
Indirect Prompt Injection
Prompt Injection
Attack Method

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Authors: Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei | Published: 2025-01-09
Text Shuffle Inconsistency
Prompt Injection
Attack Method