As large language models (LLMs) grow more capable, they face growing
vulnerability to sophisticated jailbreak attacks. While developers invest
heavily in alignment finetuning and safety guardrails, researchers continue
publishing novel attacks, driving progress through adversarial iteration. This
dynamic mirrors a strategic game of continual evolution. However, two major
challenges hinder jailbreak development: the high cost of querying top-tier
LLMs and the short lifespan of effective attacks due to frequent safety
updates. These factors limit cost-efficiency and practical impact of research
in jailbreak attacks. To address this, we propose MetaCipher, a low-cost,
multi-agent jailbreak framework that generalizes across LLMs with varying
safety measures. Using reinforcement learning, MetaCipher is modular and
adaptive, supporting extensibility to future strategies. Within as few as 10
queries, MetaCipher achieves state-of-the-art attack success rates on recent
malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct
a large-scale empirical evaluation across diverse victim models and benchmarks,
demonstrating its robustness and adaptability. Warning: This paper contains
model outputs that may be offensive or harmful, shown solely to demonstrate
jailbreak efficacy.