bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

TOP Literature Database bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2509.19775

PDF

https://arxiv.org/pdf/2509.19775

Paper Information

Author: Wence Ji,Jiancan Wu,Aiying Li,Shuyi Zhang,Junkang Wu,An Zhang,Xiang Wang,Xiangnan He
Published: 9-24-2025
Affiliation: University of Science and Technology of China
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Generative Model Prompt Injection Disabling Safety Mechanisms of LLM

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

External Datasets

Do-Anything-Now (DAN)

Do-Not-Answer (DNA)

Misuse-Addiction (Addition)

StrongREJECT

ADVbench

Anthropic RLHF dataset