Large language models (LLMs) are initially trained on vast amounts of data,
then fine-tuned using reinforcement learning from human feedback (RLHF); this
also serves to teach the LLM to provide appropriate and safe responses. In this
paper, we present a novel method to manipulate the fine-tuned version into
reverting to its pre-RLHF behavior, effectively erasing the model's filters;
the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for
Inflection-2.5. Unlike other jailbreaks (for example, the popular "Do Anything
Now" (DAN) ), our method does not rely on instructing the LLM to override its
RLHF policy; hence, simply modifying the RLHF process is unlikely to address
it. Instead, we induce a hallucination involving reversed text during which the
model reverts to a word bucket, effectively pausing the model's filter. We
believe that our exploit presents a fundamental vulnerability in LLMs currently
unaddressed, as well as an opportunity to better understand the inner workings
of LLMs during hallucinations.