Using Hallucinations to Bypass GPT4’s Filter

Authors: Benjamin Lemkin
Published: 2024-02-16 | Updated: 2024-03-11

Source: https://arxiv.org/abs/2403.04769

Labels Predicted by AI

Prompt Injection LLM Security Inappropriate Content Generation

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Large language models (LLMs) are initially trained on vast amounts of data, then fine-tuned using reinforcement learning from human feedback (RLHF); this also serves to teach the LLM to provide appropriate and safe responses. In this paper, we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model’s filters; the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for Inflection-2.5. Unlike other jailbreaks (for example, the popular “Do Anything Now” (DAN) ), our method does not rely on instructing the LLM to override its RLHF policy; hence, simply modifying the RLHF process is unlikely to address it. Instead, we induce a hallucination involving reversed text during which the model reverts to a word bucket, effectively pausing the model’s filter. We believe that our exploit presents a fundamental vulnerability in LLMs currently unaddressed, as well as an opportunity to better understand the inner workings of LLMs during hallucinations.