With the rapid advancement of large language models (LLMs), their robustness
against adversarial manipulations, particularly jailbreak backdoor attacks, has
become critically important. Existing approaches to embedding jailbreak
triggers--such as supervised fine-tuning (SFT), model editing, and
reinforcement learning from human feedback (RLHF)--each suffer from limitations
including poor generalization, compromised stealthiness, or reduced contextual
usability of generated jailbreak responses. To overcome these issues, we
propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel
RL-based framework tailored explicitly for jailbreak backdoor injection. By
employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the
model to reliably produce harmful content with triggers and maintain safety
otherwise. Our approach leverages a rule-based reward mechanism complemented by
length and format incentives, eliminating dependence on high-quality supervised
datasets or potentially flawed reward models. Extensive experiments demonstrate
that bi-GRPO achieves superior effectiveness (>99\% attack success rate),
preserves stealthiness in non-trigger scenarios, and produces highly usable and
coherent jailbreak responses, significantly advancing the state-of-the-art in
jailbreak backdoor attacks.