Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

Authors: Johan Wahréus, Ahmed Hussain, Panos Papadimitratos | Published: 2025-03-27

2025.03.272025.04.03

Authors: Johan Wahréus, Ahmed Hussain, Panos Papadimitratos
Published: 2025-03-27

Source: https://arxiv.org/abs/2503.21598

PDF: https://arxiv.org/pdf/2503.21598

AIにより推定されたラベル

プロンプトインジェクションシステム開発大規模言語モデル

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse. We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code. Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation. Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2 (SR) in generating malicious code. Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8 to our LLM jury system (73.2 single-judge assessments often accept incomplete implementations. Moreover, we demonstrate that our distributed architecture improves SRs by 12 non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts.