Early Signs of Steganographic Capabilities in Frontier LLMs

TOP 文献データベース Early Signs of Steganographic Capabilities in Frontier LLMs

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2507.02737

PDF

https://arxiv.org/pdf/2507.02737

文献情報

作者: Artur Zolkowski,Kei Nishimura-Gasparian,Robert McCarthy,Roland S. Zimmermann,David Lindner
公開日: 2025-7-4
所属機関: MATS
所属の国: Switzerland
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

透かし評価隠れタスク評価料理手順

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.

外部データセット

AQuA-RAT

GPQA-Diamond

State-Tracking