プロンプトインジェクション

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Authors: Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson | Published: 2024-06-10

LLMセキュリティ

プロンプトインジェクション

安全性アライメント

2024.06.10 2025.04.03

文献データベース

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li | Published: 2024-06-09 | Updated: 2024-06-13

LLMセキュリティ

プロンプトインジェクション

倫理的ガイドライン遵守

2024.06.09 2025.04.03

文献データベース

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Authors: Fan Liu, Zhao Xu, Hao Liu | Published: 2024-06-07

LLMセキュリティ

プロンプトインジェクション

敵対的訓練

2024.06.07 2025.04.03

文献データベース

GENIE: Watermarking Graph Neural Networks for Link Prediction

Authors: Venkata Sai Pranav Bachina, Ankit Gangwal, Aaryan Ajay Sharma, Charu Sharma | Published: 2024-06-07 | Updated: 2025-01-12

ウォーターマーキング

プロンプトインジェクション

透かしの耐久性

2024.06.07 2025.04.03

文献データベース

AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

Authors: Lin Lu, Hai Yan, Zenghui Yuan, Jiawen Shi, Wenqi Wei, Pin-Yu Chen, Pan Zhou | Published: 2024-06-06

LLM性能評価

プロンプトインジェクション

防御手法

2024.06.06 2025.04.03

文献データベース

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

Authors: Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian | Published: 2024-06-05

LLMセキュリティ

バックドア攻撃

プロンプトインジェクション

2024.06.05 2025.04.03

文献データベース

Safeguarding Large Language Models: A Survey

Authors: Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang | Published: 2024-06-03

LLMセキュリティ

ガードレール手法

プロンプトインジェクション

2024.06.03 2025.04.03

文献データベース

Decoupled Alignment for Robust Plug-and-Play Adaptation

Authors: Haozheng Luo, Jiahao Yu, Wenxin Zhang, Jialong Li, Jerry Yao-Chieh Hu, Xinyu Xing, Han Liu | Published: 2024-06-03 | Updated: 2024-06-06

LLM性能評価

プロンプトインジェクション

モデル性能評価

2024.06.03 2025.04.03

文献データベース

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Authors: Diego Dorn, Alexandre Variengien, Charbel-Raphaël Segerie, Vincent Corruble | Published: 2024-06-03

LLMセキュリティ

コンテンツモデレーション

プロンプトインジェクション

2024.06.03 2025.04.03

文献データベース

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Authors: Frank Weizhen Liu, Chenhui Hu | Published: 2024-06-01

LLMセキュリティ

プロンプトインジェクション

防御手法

2024.06.01 2025.04.03

文献データベース