These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Safety is a paramount concern for large language models (LLMs) in open
deployment, motivating the development of safeguard methods that enforce
ethical and responsible use through safety alignment or guardrail mechanisms.
Jailbreak attacks that exploit the \emph{false negatives} of safeguard methods
have emerged as a prominent research focus in the field of LLM security.
However, we found that the malicious attackers could also exploit false
positives of safeguards, i.e., fooling the safeguard model to block safe
content mistakenly, leading to a denial-of-service (DoS) affecting LLM users.
To bridge the knowledge gap of this overlooked threat, we explore multiple
attack methods that include inserting a short adversarial prompt into user
prompt templates and corrupting the LLM on the server by poisoned fine-tuning.
In both ways, the attack triggers safeguard rejections of user requests from
the client. Our evaluation demonstrates the severity of this threat across
multiple scenarios. For instance, in the scenario of white-box adversarial
prompt injection, the attacker can use our optimization process to
automatically generate seemingly safe adversarial prompts, approximately only
30 characters long, that universally block over 97% of user requests on Llama
Guard 3. These findings reveal a new dimension in LLM safeguard evaluation --
adversarial robustness to false positives.