Adversarial Suffix Filtering: a Defense Pipeline for LLMs

TOP Literature Database Adversarial Suffix Filtering: a Defense Pipeline for LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.09602

PDF

https://arxiv.org/pdf/2505.09602

Paper Information

Author: David Khachaturov,Robert Mullins
Published: 5-15-2025
Affiliation: Department of Computer Science and Technology, University of Cambridge
Country: United Kingdom
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt validation Ethical Standards Compliance Attack Detection Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) are increasingly embedded in autonomous systems and public-facing environments, yet they remain susceptible to jailbreak vulnerabilities that may undermine their security and trustworthiness. Adversarial suffixes are considered to be the current state-of-the-art jailbreak, consistently outperforming simpler methods and frequently succeeding even in black-box settings. Existing defenses rely on access to the internal architecture of models limiting diverse deployment, increase memory and computation footprints dramatically, or can be bypassed with simple prompt engineering methods. We introduce $\textbf{Adversarial Suffix Filtering}$ (ASF), a lightweight novel model-agnostic defensive pipeline designed to protect LLMs against adversarial suffix attacks. ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts, effectively neutralizing malicious injections. We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings, reducing the attack efficacy of state-of-the-art adversarial suffix generation methods to below 4%, while only minimally affecting the target model's capabilities in non-adversarial scenarios.

External Datasets

adversarial suffix dataset provided by Liao and Sun

Stanford Alpaca instruction dataset

MaliciousInstruct

AdvBench