Large Language Models (LLMs) are now widely used in various applications,
making it crucial to align their ethical standards with human values. However,
recent jail-breaking methods demonstrate that this alignment can be undermined
using carefully constructed prompts. In our study, we reveal a new threat to
LLM alignment when a bad actor has access to the model's output logits, a
common feature in both open-source LLMs and many commercial LLM APIs (e.g.,
certain GPT models). It does not rely on crafting specific prompts. Instead, it
exploits the fact that even when an LLM rejects a toxic request, a harmful
response often hides deep in the output logits. By forcefully selecting
lower-ranked output tokens during the auto-regressive generation process at a
few critical output positions, we can compel the model to reveal these hidden
responses. We term this process model interrogation. This approach differs from
and outperforms jail-breaking methods, achieving 92% effectiveness compared to
62%, and is 10 to 20 times faster. The harmful content uncovered through our
method is more relevant, complete, and clear. Additionally, it can complement
jail-breaking strategies, with which results in further boosting attack
performance. Our findings indicate that interrogation can extract toxic
knowledge even from models specifically designed for coding tasks.