These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) are initially trained on vast amounts of data,
then fine-tuned using reinforcement learning from human feedback (RLHF); this
also serves to teach the LLM to provide appropriate and safe responses. In this
paper, we present a novel method to manipulate the fine-tuned version into
reverting to its pre-RLHF behavior, effectively erasing the model's filters;
the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for
Inflection-2.5. Unlike other jailbreaks (for example, the popular "Do Anything
Now" (DAN) ), our method does not rely on instructing the LLM to override its
RLHF policy; hence, simply modifying the RLHF process is unlikely to address
it. Instead, we induce a hallucination involving reversed text during which the
model reverts to a word bucket, effectively pausing the model's filter. We
believe that our exploit presents a fundamental vulnerability in LLMs currently
unaddressed, as well as an opportunity to better understand the inner workings
of LLMs during hallucinations.