攻撃手法

Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Authors: Andy Zhou | Published: 2025-03-13 | Updated: 2025-03-16
LLMの安全機構の解除
攻撃手法
生成モデル

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

Authors: Jeonghwan Park, Niall McLaughlin, Ihsen Alouani | Published: 2025-03-04 | Updated: 2025-03-16
攻撃手法
敵対的サンプルの検知
深層学習

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Authors: Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi | Published: 2025-02-23
プロンプトの検証
悪意のあるプロンプト
攻撃手法

Safety at Scale: A Comprehensive Survey of Large Model Safety

Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang | Published: 2025-02-02 | Updated: 2025-03-19
インダイレクトプロンプトインジェクション
プロンプトインジェクション
攻撃手法

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Authors: Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei | Published: 2025-01-09
テキストシャッフル不整合
プロンプトインジェクション
攻撃手法

AutoDFL: A Scalable and Automated Reputation-Aware Decentralized Federated Learning

Authors: Meryem Malak Dif, Mouhamed Amine Bouchiha, Mourad Rabah, Yacine Ghamri-Doudane | Published: 2025-01-08
プライバシー保護
フレームワーク
攻撃手法

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

Authors: Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun | Published: 2025-01-03
フレームワーク
プロンプトインジェクション
攻撃手法

Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Authors: Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng | Published: 2025-01-02 | Updated: 2025-01-10
攻撃の評価
攻撃手法
敵対的サンプル

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Authors: Ma Teng, Jia Xiaojun, Duan Ranjie, Li Xinfeng, Huang Yihao, Chu Zhixuan, Liu Yang, Ren Wenqi | Published: 2024-12-08 | Updated: 2025-01-03
コンテンツモデレーション
プロンプトインジェクション
攻撃手法

Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians

Authors: William N. Caballero, Matthew LaRosa, Alexander Fisher, Vahid Tarokh | Published: 2024-11-21
攻撃手法
最適化問題