These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the rapid advancement of large language models (LLMs), ensuring their
safe use becomes increasingly critical. Fine-tuning is a widely used method for
adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks.
However, most existing studies focus on overly simplified attack scenarios,
limiting their practical relevance to real-world defense settings. To make this
risk concrete, we present a three-pronged jailbreak attack and evaluate it
against provider defenses under a dataset-only black-box fine-tuning interface.
In this setting, the attacker can only submit fine-tuning data to the provider,
while the provider may deploy defenses across stages: (1) pre-upload data
filtering, (2) training-time defensive fine-tuning, and (3) post-training
safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign
lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism,
enabling the model to learn harmful behaviors while individual datapoints
appear innocuous. Extensive experiments demonstrate the effectiveness of our
approach. In real-world deployment, our method successfully jailbreaks GPT-4.1
and GPT-4o on the OpenAI platform with attack success rates above 97% for both
models. Our code is available at
https://github.com/lxf728/tri-pronged-ft-attack.