These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Text-to-image (T2I) models raise ethical and safety concerns due to their
potential to generate inappropriate or harmful images. Evaluating these models'
security through red-teaming is vital, yet white-box approaches are limited by
their need for internal access, complicating their use with closed-source
models. Moreover, existing black-box methods often assume knowledge about the
model's specific defense mechanisms, limiting their utility in real-world
commercial API scenarios. A significant challenge is how to evade unknown and
diverse defense mechanisms. To overcome this difficulty, we propose a novel
Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively
employs LLM to modify prompts to query and leverages feedback from T2I systems
for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a
prior, enabling the LLM to dynamically adapt to unknown defense mechanisms.
Given that the feedback is often labeled and coarse-grained, making it
difficult to utilize directly, we further propose rule-based preference
modeling, which employs a set of rules to evaluate desired or undesired
feedback, facilitating finer-grained control over the LLM's dynamic adaptation
process. Extensive experiments on nineteen T2I systems with varied safety
mechanisms, three online commercial API services, and T2V models verify the
superiority and practicality of our approach.