These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that
bypass their safety mechanisms. Existing attack methods are fixed or
specifically tailored for certain models and cannot flexibly adjust attack
strength, which is critical for generalization when attacking models of various
sizes. We introduce a novel scalable jailbreak attack that preempts the
activation of an LLM's safety policies by occupying its computational
resources. Our method involves engaging the LLM in a resource-intensive
preliminary task - a Character Map lookup and decoding process - before
presenting the target instruction. By saturating the model's processing
capacity, we prevent the activation of safety protocols when processing the
subsequent instruction. Extensive experiments on state-of-the-art LLMs
demonstrate that our method achieves a high success rate in bypassing safety
measures without requiring gradient access, manual prompt engineering. We
verified our approach offers a scalable attack that quantifies attack strength
and adapts to different model scales at the optimal strength. We shows safety
policies of LLMs might be more susceptible to resource constraints. Our
findings reveal a critical vulnerability in current LLM safety designs,
highlighting the need for more robust defense strategies that account for
resource-intense condition.