Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

TOP Literature Database Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2312.03853

PDF

https://arxiv.org/pdf/2312.03853

Paper Information

Author: Matteo Gioele Collu;Tom Janssen-Groesbeek;Stefanos Koffas;Mauro Conti;Stjepan Picek
Published: 12-7-2023
Updated: 10-8-2024
Affiliation: University of Padua
Country: Italy
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Character Role Acting Prompt Injection Poisoning

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbots. Safety mechanisms are implemented to prevent improper responses from these chatbots. In this work, we bypass these measures for ChatGPT and Gemini by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. First, we create elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information in both ChatGPT and Gemini. We also introduce several ways of activating such adversarial personas, showing that both chatbots are vulnerable to this attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.