An Adversarial Perspective on Machine Unlearning for AI Safety Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando | Published: 2024-09-26 | Updated: 2025-04-10 Prompt InjectionSafety AlignmentMachine Unlearning 2024.09.26 2025.05.27 Literature Database
Safeguarding AI Agents: Developing and Analyzing Safety Architectures Authors: Ishaan Domkundwar, Mukunda N S, Ishaan Bhola | Published: 2024-09-03 | Updated: 2024-09-13 Content ModerationInternal Review SystemSafety Alignment 2024.09.03 2025.05.27 Literature Database
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Authors: Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu | Published: 2024-08-18 | Updated: 2024-09-03 LLM SecurityPrompt InjectionSafety Alignment 2024.08.18 2025.05.27 Literature Database
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming Authors: Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi | Published: 2024-08-14 WatermarkingDataset GenerationSafety Alignment 2024.08.14 2025.05.27 Literature Database
Safety Alignment Should Be Made More Than Just a Few Tokens Deep Authors: Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson | Published: 2024-06-10 LLM SecurityPrompt InjectionSafety Alignment 2024.06.10 2025.05.27 Literature Database
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models Authors: Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, Jingyi Wang | Published: 2024-05-23 | Updated: 2025-04-07 Risk Analysis MethodLarge Language ModelSafety Alignment 2024.05.23 2025.05.27 Literature Database
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes Authors: Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, Prashanth Harshangi | Published: 2024-04-05 | Updated: 2024-09-09 LLM SecurityPrompt InjectionSafety Alignment 2024.04.05 2025.05.27 Literature Database
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety Authors: Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao | Published: 2024-01-22 | Updated: 2024-08-20 Prompt InjectionSafety AlignmentPsychological Manipulation 2024.01.22 2025.05.27 Literature Database
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Authors: Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun | Published: 2023-10-23 | Updated: 2023-12-14 Prompt InjectionSafety AlignmentAttack Method 2023.10.23 2025.05.28 Literature Database
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models Authors: Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin | Published: 2023-10-04 Prompt InjectionSafety AlignmentMalicious Content Generation 2023.10.04 2025.05.28 Literature Database