Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

TOP 文献データベース Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2504.11168

PDF

https://arxiv.org/pdf/2504.11168

文献情報

作者: William Hackett,Lewis Birch,Stefan Trawicki,Neeraj Suri,Peter Garraghan
公開日: 2025-4-15
更新日: 2025-4-17
所属機関: Mindgard
所属の国: Unknown
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プロンプトインジェクション LLM性能評価敵対的攻撃分析

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

外部データセット

safe-guard-prompt-injection

jailbreak prompt repository