These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the rapid advancement of large language models (LLMs), their robustness
against adversarial manipulations, particularly jailbreak backdoor attacks, has
become critically important. Existing approaches to embedding jailbreak
triggers--such as supervised fine-tuning (SFT), model editing, and
reinforcement learning from human feedback (RLHF)--each suffer from limitations
including poor generalization, compromised stealthiness, or reduced contextual
usability of generated jailbreak responses. To overcome these issues, we
propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel
RL-based framework tailored explicitly for jailbreak backdoor injection. By
employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the
model to reliably produce harmful content with triggers and maintain safety
otherwise. Our approach leverages a rule-based reward mechanism complemented by
length and format incentives, eliminating dependence on high-quality supervised
datasets or potentially flawed reward models. Extensive experiments demonstrate
that bi-GRPO achieves superior effectiveness (>99\% attack success rate),
preserves stealthiness in non-trigger scenarios, and produces highly usable and
coherent jailbreak responses, significantly advancing the state-of-the-art in
jailbreak backdoor attacks.