These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language model systems face important security risks from maliciously
crafted messages that aim to overwrite the system's original instructions or
leak private data. To study this problem, we organized a capture-the-flag
competition at IEEE SaTML 2024, where the flag is a secret string in the LLM
system prompt. The competition was organized in two phases. In the first phase,
teams developed defenses to prevent the model from leaking the secret. During
the second phase, teams were challenged to extract the secrets hidden for
defenses proposed by the other teams. This report summarizes the main insights
from the competition. Notably, we found that all defenses were bypassed at
least once, highlighting the difficulty of designing a successful defense and
the necessity for additional research to protect LLM systems. To foster future
research in this direction, we compiled a dataset with over 137k multi-turn
attack chats and open-sourced the platform.