These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreak attacks on Large Language Models (LLMs) have demonstrated various
successful methods whereby attackers manipulate models into generating harmful
responses that they are designed to avoid. Among these, Greedy Coordinate
Gradient (GCG) has emerged as a general and effective approach that optimizes
the tokens in a suffix to generate jailbreakable prompts. While several
improved variants of GCG have been proposed, they all rely on fixed-length
suffixes. However, the potential redundancy within these suffixes remains
unexplored. In this work, we propose Mask-GCG, a plug-and-play method that
employs learnable token masking to identify impactful tokens within the suffix.
Our approach increases the update probability for tokens at high-impact
positions while pruning those at low-impact positions. This pruning not only
reduces redundancy but also decreases the size of the gradient space, thereby
lowering computational overhead and shortening the time required to achieve
successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the
original GCG and several improved variants. Experimental results show that most
tokens in the suffix contribute significantly to attack success, and pruning a
minority of low-impact tokens does not affect the loss values or compromise the
attack success rate (ASR), thereby revealing token redundancy in LLM prompts.
Our findings provide insights for developing efficient and interpretable LLMs
from the perspective of jailbreak attacks.