These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Recent advancements have significantly improved automated task-solving
capabilities using autonomous agents powered by large language models (LLMs).
However, most LLM-based agents focus on dialogue, programming, or specialized
domains, leaving their potential for addressing generative AI safety tasks
largely unexplored. In this paper, we propose Atlas, an advanced LLM-based
multi-agent framework targeting generative AI models, specifically focusing on
jailbreak attacks against text-to-image (T2I) models with built-in safety
filters. Atlas consists of two agents, namely the mutation agent and the
selection agent, each comprising four key modules: a vision-language model
(VLM) or LLM brain, planning, memory, and tool usage. The mutation agent uses
its VLM brain to determine whether a prompt triggers the T2I model's safety
filter. It then collaborates iteratively with the LLM brain of the selection
agent to generate new candidate jailbreak prompts with the highest potential to
bypass the filter. In addition to multi-agent communication, we leverage
in-context learning (ICL) memory mechanisms and the chain-of-thought (COT)
approach to learn from past successes and failures, thereby enhancing Atlas's
performance. Our evaluation demonstrates that Atlas successfully jailbreaks
several state-of-the-art T2I models equipped with multi-modal safety filters in
a black-box setting. Additionally, Atlas outperforms existing methods in both
query efficiency and the quality of generated images. This work convincingly
demonstrates the successful application of LLM-based agents in studying the
safety vulnerabilities of popular text-to-image generation models. We urge the
community to consider advanced techniques like ours in response to the rapidly
evolving text-to-image generation field.