These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Multimodal large language models (MLLMs) are rapidly evolving, presenting
increasingly complex safety challenges. However, current dataset construction
methods, which are risk-oriented, fail to cover the growing complexity of
real-world multimodal safety scenarios (RMS). And due to the lack of a unified
evaluation metric, their overall effectiveness remains unproven. This paper
introduces a novel image-oriented self-adaptive dataset construction method for
RMS, which starts with images and end constructing paired text and guidance
responses. Using the image-oriented method, we automatically generate an RMS
dataset comprising 35k image-text pairs with guidance responses. Additionally,
we introduce a standardized safety dataset evaluation metric: fine-tuning a
safety judge model and evaluating its capabilities on other safety
datasets.Extensive experiments on various tasks demonstrate the effectiveness
of the proposed image-oriented pipeline. The results confirm the scalability
and effectiveness of the image-oriented approach, offering a new perspective
for the construction of real-world multimodal safety datasets.