May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

TOP Literature Database May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2507.07417

PDF

https://arxiv.org/pdf/2507.07417

Paper Information

Author: Nishit V. Pandya,Andrey Labunets,Sicun Gao,Earlence Fernandes
Published: 7-10-2025
Affiliation: University of California, San Diego
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Adversarial attack Indirect Prompt Injection Defense Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks

External Datasets

AlpacaFarm