PRISON: Unmasking the Criminal Potential of Large Language Models

TOP Literature Database PRISON: Unmasking the Criminal Potential of Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2506.16150

PDF

https://arxiv.org/pdf/2506.16150

Paper Information

Author: Xinyi Wu,Geng Hong,Pei Chen,Yueyue Chen,Xudong Pan,Min Yang
Published: 6-19-2025
Updated: 8-4-2025
Affiliation: Fudan University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Disabling Safety Mechanisms of LLM Research Methodology Law Enforcement Evasion

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.

External Datasets

IMDb dataset