These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have become integral to many applications, with
system prompts serving as a key mechanism to regulate model behavior and ensure
ethical outputs. In this paper, we introduce a novel backdoor attack that
systematically bypasses these system prompts, posing significant risks to the
AI supply chain. Under normal conditions, the model adheres strictly to its
system prompts. However, our backdoor allows malicious actors to circumvent
these safeguards when triggered. Specifically, we explore a scenario where an
LLM provider embeds a covert trigger within the base model. A downstream
deployer, unaware of the hidden trigger, fine-tunes the model and offers it as
a service to users. Malicious actors can purchase the trigger from the provider
and use it to exploit the deployed model, disabling system prompts and
achieving restricted outcomes. Our attack utilizes a permutation trigger, which
activates only when its components are arranged in a precise order, making it
computationally challenging to detect or reverse-engineer. We evaluate our
approach on five state-of-the-art models, demonstrating that our method
achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean
accuracy (CACC) of 98.58%, even after defensive fine-tuning. These findings
highlight critical vulnerabilities in LLM deployment pipelines and underscore
the need for stronger defenses.