Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

TOP Literature Database Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2410.02064

PDF

https://arxiv.org/pdf/2410.02064

Paper Information

Author: Christopher Ackerman,Nina Panickssery
Published: 10-3-2024
Updated: 1-26-2025
Affiliation
Country
Conference: International Conference on Learning Representations (ICLR)

Labels Estimated by AI

Identification of AI Output Self-Aware Model Prompting Strategy

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

External Datasets

CNN-Dailymail

Extreme Summarization

DataBricks-Dolly

Situational Awareness Dataset