How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li | Published: 2024-06-09 | Updated: 2024-06-13 LLM SecurityPrompt InjectionCompliance with Ethical Guidelines 2024.06.09 2025.05.27 Literature Database
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs Authors: Fan Liu, Zhao Xu, Hao Liu | Published: 2024-06-07 LLM SecurityPrompt InjectionAdversarial Training 2024.06.07 2025.05.27 Literature Database
GENIE: Watermarking Graph Neural Networks for Link Prediction Authors: Venkata Sai Pranav Bachina, Ankit Gangwal, Aaryan Ajay Sharma, Charu Sharma | Published: 2024-06-07 | Updated: 2025-01-12 WatermarkingPrompt InjectionWatermark Robustness 2024.06.07 2025.05.27 Literature Database
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens Authors: Lin Lu, Hai Yan, Zenghui Yuan, Jiawen Shi, Wenqi Wei, Pin-Yu Chen, Pan Zhou | Published: 2024-06-06 LLM Performance EvaluationPrompt InjectionDefense Method 2024.06.06 2025.05.27 Literature Database
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents Authors: Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian | Published: 2024-06-05 LLM SecurityBackdoor AttackPrompt Injection 2024.06.05 2025.05.27 Literature Database
Safeguarding Large Language Models: A Survey Authors: Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang | Published: 2024-06-03 LLM SecurityGuardrail MethodPrompt Injection 2024.06.03 2025.05.27 Literature Database
Decoupled Alignment for Robust Plug-and-Play Adaptation Authors: Haozheng Luo, Jiahao Yu, Wenxin Zhang, Jialong Li, Jerry Yao-Chieh Hu, Xinyu Xing, Han Liu | Published: 2024-06-03 | Updated: 2024-06-06 LLM Performance EvaluationPrompt InjectionModel Performance Evaluation 2024.06.03 2025.05.27 Literature Database
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards Authors: Diego Dorn, Alexandre Variengien, Charbel-Raphaël Segerie, Vincent Corruble | Published: 2024-06-03 LLM SecurityContent ModerationPrompt Injection 2024.06.03 2025.05.27 Literature Database
Exploring Vulnerabilities and Protections in Large Language Models: A Survey Authors: Frank Weizhen Liu, Chenhui Hu | Published: 2024-06-01 LLM SecurityPrompt InjectionDefense Method 2024.06.01 2025.05.27 Literature Database
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models Authors: Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin | Published: 2024-05-31 | Updated: 2024-06-05 LLM SecurityWatermarkingPrompt Injection 2024.05.31 2025.05.27 Literature Database