Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

TOP 文献データベース Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

Computing Research Repository (CoRR)

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2501.09798

PDF

https://arxiv.org/pdf/2501.09798

文献情報

作者: Andrey Labunets;Nishit V. Pandya;Ashish Hooda;Xiaohan Fu;Earlence Fernandes
公開日: 2025-3-18
所属機関: UC San Diego
所属の国: United States of America
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プロンプトインジェクション最適化問題攻撃の評価

Abstract

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.