These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Multi-modal large language models (MLLMs) have made significant progress, yet
their safety alignment remains limited. Typically, current open-source MLLMs
rely on the alignment inherited from their language module to avoid harmful
generations. However, the lack of safety measures specifically designed for
multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to
vision-domain attacks such as typographic manipulation. Current methods utilize
a carefully designed safety dataset to enhance model defense capability, while
the specific knowledge or patterns acquired from the high-quality dataset
remain unclear. Through comparison experiments, we find that the alignment gap
primarily arises from data distribution biases, while image content, response
quality, or the contrastive behavior of the dataset makes little contribution
to boosting multi-modal safety. To further investigate this and identify the
key factors in improving MLLM safety, we propose finetuning MLLMs on a small
set of benign instruct-following data with responses replaced by simple, clear
rejection sentences. Experiments show that, without the need for
labor-intensive collection of high-quality malicious data, model safety can
still be significantly improved, as long as a specific fraction of rejection
data exists in the finetuning set, indicating the security alignment is not
lost but rather obscured during multi-modal pretraining or instruction
finetuning. Simply correcting the underlying data bias could narrow the safety
gap in the vision domain.