The advent of Large Language Models (LLMs) has marked significant
achievements in language processing and reasoning capabilities. Despite their
advancements, LLMs face vulnerabilities to data poisoning attacks, where the
adversary inserts backdoor triggers into training data to manipulate outputs.
This work further identifies additional security risks in LLMs by designing a
new data poisoning attack tailored to exploit the supervised fine-tuning (SFT)
process. We propose a novel gradient-guided backdoor trigger learning (GBTL)
algorithm to identify adversarial triggers efficiently, ensuring an evasion of
detection by conventional defenses while maintaining content integrity. Through
experimental validation across various language model tasks, including
sentiment analysis, domain generation, and question answering, our poisoning
strategy demonstrates a high success rate in compromising various LLMs'
outputs. We further propose two defense strategies against data poisoning
attacks, including in-context learning (ICL) and continuous learning (CL),
which effectively rectify the behavior of LLMs and significantly reduce the
decline in performance. Our work highlights the significant security risks
present during SFT of LLMs and the necessity of safeguarding LLMs against data
poisoning attacks.