These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
We investigate long-context vulnerabilities in Large Language Models (LLMs)
through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of
up to 128K tokens. Through comprehensive analysis with various many-shot attack
settings with different instruction styles, shot density, topic, and format, we
reveal that context length is the primary factor determining attack
effectiveness. Critically, we find that successful attacks do not require
carefully crafted harmful content. Even repetitive shots or random dummy text
can circumvent model safety measures, suggesting fundamental limitations in
long-context processing capabilities of LLMs. The safety behavior of
well-aligned models becomes increasingly inconsistent with longer contexts.
These findings highlight significant safety gaps in context expansion
capabilities of LLMs, emphasizing the need for new safety mechanisms.