Large language models (LLMs) are increasingly deployed in real-world
applications ranging from chatbots to agentic systems, where they are expected
to process untrusted data and follow trusted instructions. Failure to
distinguish between the two poses significant security risks, exploited by
prompt injection attacks, which inject malicious instructions into the data to
control model outputs. Model-level defenses have been proposed to mitigate
prompt injection attacks. These defenses fine-tune LLMs to ignore injected
instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack
against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy
Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints
produced during fine-tuning to initialize GCG, with each checkpoint acting as a
stepping stone for the next one to continuously improve attacks. First, we
instantiate Checkpoint-GCG to evaluate the robustness of the state-of-the-art
defenses in an auditing setup, assuming both (a) full knowledge of the model
input and (b) access to intermediate model checkpoints. We show Checkpoint-GCG
to achieve up to $96\%$ attack success rate (ASR) against the strongest
defense. Second, we relax the first assumption by searching for a universal
suffix that would work on unseen inputs, and obtain up to $89.9\%$ ASR against
the strongest defense. Finally, we relax both assumptions by searching for a
universal suffix that would transfer to similar black-box models and defenses,
achieving an ASR of $63.9\%$ against a newly released defended model from Meta.