Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

TOP Literature Database Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2311.16119

PDF

https://arxiv.org/pdf/2311.16119

Paper Information

Author: Sander Schulhoff;Jeremy Pinto;Anaum Khan;Louis-François Bouchard;Chenglei Si;Svetlina Anati;Valen Tagliabue;Anson Liu Kost;Christopher Carnahan;Jordan Boyd-Graber
Published: 10-25-2023
Updated: 3-3-2024
Affiliation: University of Maryland
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Attack Method Text Generation Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.

External Datasets

Submissions Dataset

Playground Dataset