Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit
meticulously crafted prompts to elicit content that violates service
guidelines, have captured the attention of research communities. While model
owners can defend against individual jailbreak prompts through safety training
strategies, this relatively passive approach struggles to handle the broader
category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an
automated fuzzing framework designed to proactively test and discover jailbreak
vulnerabilities in LLMs. We utilize templates to capture the structural
integrity of a prompt and isolate key features of a jailbreak class as
constraints. By integrating different base classes into powerful combo attacks
and varying the elements of constraints and prohibited questions, FuzzLLM
enables efficient testing with reduced manual effort. Extensive experiments
demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability
discovery across various LLMs.
Machine Learning and Knowledge Discovery in Databases: European Conference
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndić, Pavel Laskov, Giorgio Giacinto, Fabio Roli
Published: 2013
Association for Computing Machinery
Adversarial examples are not easily detected: Bypassing ten detection methods
N. Carlini, D. Wagner
Published: 2017
arxiv
被引用数 1
European Symposium on Security and Privacy (EuroS&P)
The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, Ananthram Swami
Published: 2015.11.24
Deep learning takes advantage of large datasets and computationally efficient
training algorithms to outperform other approaches at various machine learning
tasks. However, imperfections in the training phase of deep neural networks
make them vulnerable to adversarial samples: inputs crafted by adversaries with
the intent of causing deep neural networks to misclassify. In this work, we
formalize the space of adversaries against deep neural networks (DNNs) and
introduce a novel class of algorithms to craft adversarial samples based on a
precise understanding of the mapping between inputs and outputs of DNNs. In an
application to computer vision, we show that our algorithms can reliably
produce samples correctly classified by human subjects but misclassified in
specific targets by a DNN with a 97% adversarial success rate while only
modifying on average 4.02% of the input features per sample. We then evaluate
the vulnerability of different sample classes to adversarial perturbations by
defining a hardness measure. Finally, we describe preliminary work outlining
defenses against adversarial samples by defining a predictive measure of
distance between a benign input and a target classification.