Safety Alignment

An Adversarial Perspective on Machine Unlearning for AI Safety

Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando | Published: 2024-09-26 | Updated: 2025-04-10
Prompt Injection
Safety Alignment
Machine Unlearning

Safeguarding AI Agents: Developing and Analyzing Safety Architectures

Authors: Ishaan Domkundwar, Mukunda N S, Ishaan Bhola | Published: 2024-09-03 | Updated: 2024-09-13
Content Moderation
Internal Review System
Safety Alignment

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Authors: Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu | Published: 2024-08-18 | Updated: 2024-09-03
LLM Security
Prompt Injection
Safety Alignment

SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming

Authors: Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi | Published: 2024-08-14
Watermarking
Dataset Generation
Safety Alignment

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Authors: Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson | Published: 2024-06-10
LLM Security
Prompt Injection
Safety Alignment

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Authors: Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, Jingyi Wang | Published: 2024-05-23 | Updated: 2025-04-07
Risk Analysis Method
Large Language Model
Safety Alignment

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Authors: Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, Prashanth Harshangi | Published: 2024-04-05 | Updated: 2024-09-09
LLM Security
Prompt Injection
Safety Alignment

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

Authors: Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao | Published: 2024-01-22 | Updated: 2024-08-20
Prompt Injection
Safety Alignment
Psychological Manipulation

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Authors: Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun | Published: 2023-10-23 | Updated: 2023-12-14
Prompt Injection
Safety Alignment
Attack Method

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Authors: Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin | Published: 2023-10-04
Prompt Injection
Safety Alignment
Malicious Content Generation