JULI: Jailbreak Large Language Models by Self-Introspection

Authors: Jesson Wang, Zhanhao Hu, David Wagner | Published: 2025-05-17 | Updated: 2025-05-20

2025.05.172025.05.28

Authors: Jesson Wang, Zhanhao Hu, David Wagner
Published: 2025-05-17 | Updated: 2025-05-20

Source: https://arxiv.org/abs/2505.11790

PDF: https://arxiv.org/pdf/2505.11790

Labels Predicted by AI

Disabling Safety Mechanisms of LLM Prompt Injection API Security

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM’s predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-5 token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.