Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

Authors: Jiawen Wang, Pritha Gupta, Ivan Habernal, Eyke Hüllermeier | Published: 2025-05-20

2025.05.20

Authors: Jiawen Wang, Pritha Gupta, Ivan Habernal, Eyke Hüllermeier
Published: 2025-05-20

Source: https://arxiv.org/abs/2505.14368

PDF: https://arxiv.org/pdf/2505.14368

AIにより推定されたラベル

LLMの安全機構の解除プロンプトインジェクション LLMセキュリティ

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the 14 most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model’s response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around 90 prefix attacks can break all 14 open-source LLMs, achieving over 60 LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.