Cross-Modal Safety Alignment: Is textual unlearning all you need? | AI Security Portal

JA

JA

EN

TOP Literature Database Cross-Modal Safety Alignment: Is textual unlearning all you need?

arxiv

Cross-Modal Safety Alignment: Is textual unlearning all you need?

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2406.02575

PDF

https://arxiv.org/pdf/2406.02575

Paper Information

Author: Trishna Chakraborty,Erfan Shayegani,Zikui Cai,Nael Abu-Ghazaleh,M. Salman Asif,Yue Dong,Amit K. Roy-Chowdhury,Chengyu Song
Published: 5-28-2024
Updated: 10-14-2025
Affiliation: University of California, Riverside
Country: United States of America
Conference: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Labels Estimated by AI

Large Language Model Calculation of Output Harmfulness Privacy Enhancing Technology

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly 2\% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.

External Datasets

PKU-SafeRLHF

Truthful-QA

VQA-v2

LLaVA-Instruct

Jailbreak in Pieces

JailBreakV-28K

Figstep