These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) are susceptible to jailbreak attacks that can
induce them to generate harmful content. Previous jailbreak methods primarily
exploited the internal properties or capabilities of LLMs, such as
optimization-based jailbreak methods and methods that leveraged the model's
context-learning abilities. In this paper, we introduce a novel jailbreak
method, SQL Injection Jailbreak (SIJ), which targets the external properties of
LLMs, specifically, the way LLMs construct input prompts. By injecting
jailbreak information into user prompts, SIJ successfully induces the model to
output harmful content. For open-source models, SIJ achieves near 100% attack
success rates on five well-known LLMs on the AdvBench and HEx-PHI, while
incurring lower time costs compared to previous methods. For closed-source
models, SIJ achieves an average attack success rate over 85% across five models
in the GPT and Doubao series. Additionally, SIJ exposes a new vulnerability in
LLMs that urgently requires mitigation. To address this, we propose a simple
adaptive defense method called Self-Reminder-Key to counter SIJ and demonstrate
its effectiveness through experimental results. Our code is available at
https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.