Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

TOP Literature Database Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.14368

PDF

https://arxiv.org/pdf/2505.14368

Paper Information

Author: Jiawen Wang,Pritha Gupta,Ivan Habernal,Eyke Hüllermeier
Published: 5-20-2025
Affiliation: LMU Munich
Country: Germany
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Disabling Safety Mechanisms of LLM Prompt Injection LLM Security

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the $\mathbf{14}$ most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model's response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around $90$% ASP. They also indicate that our ignore prefix attacks can break all $\mathbf{14}$ open-source LLMs, achieving over $60$% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.

External Datasets

AdvBench

JailbreakBench

HarmBench

WalledEval

SAP10