JULI: Jailbreak Large Language Models by Self-Introspection

TOP Literature Database JULI: Jailbreak Large Language Models by Self-Introspection

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.11790

PDF

https://arxiv.org/pdf/2505.11790

Paper Information

Author: Jesson Wang,Zhanhao Hu,David Wagner
Published: 5-17-2025
Updated: 8-7-2025
Affiliation: Wuhan University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Disabling Safety Mechanisms of LLM Prompt Injection API Security

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.

External Datasets

AdvBench

MaliciousInstruct