Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment

TOP 文献データベース Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment

AISec@CCS

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2410.14827

PDF

https://arxiv.org/pdf/2410.14827

文献情報

作者: Zedian Shao,Hongbin Liu,Jaden Mu,Neil Zhenqiang Gong
公開日: 2025-9-17
所属機関: Georgia Institute of Technology
所属の国: United States of America
会議名: AISec@CCS

AIにより推定されたラベル

インダイレクトプロンプトインジェクションバックドア攻撃手法データ汚染検出

Abstract

Prompt injection attack, where an attacker injects a prompt into the original one, aiming to make an Large Language Model (LLM) follow the injected prompt to perform an attacker-chosen task, represent a critical security threat. Existing attacks primarily focus on crafting these injections at inference time, treating the LLM itself as a static target. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we introduces a more foundational attack vector: poisoning the LLM's alignment process to amplify the success of future prompt injection attacks. Specifically, we propose PoisonedAlign, a method that strategically creates poisoned alignment samples to poison an LLM's alignment dataset. Our experiments across five LLMs and two alignment datasets show that when even a small fraction of the alignment data is poisoned, the resulting model becomes substantially more vulnerable to a wide range of prompt injection attacks. Crucially, this vulnerability is instilled while the LLM's performance on standard capability benchmarks remains largely unchanged, making the manipulation difficult to detect through automated, general-purpose performance evaluations. The code for implementing the attack is available at https://github.com/Sadcardation/PoisonedAlign.