These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Adversarial purification is a defense mechanism for safeguarding classifiers
against adversarial attacks without knowing the type of attacks or training of
the classifier. These techniques characterize and eliminate adversarial
perturbations from the attacked inputs, aiming to restore purified samples that
retain similarity to the initially attacked ones and are correctly classified
by the classifier. Due to the inherent challenges associated with
characterizing noise perturbations for discrete inputs, adversarial text
purification has been relatively unexplored. In this paper, we investigate the
effectiveness of adversarial purification methods in defending text
classifiers. We propose a novel adversarial text purification that harnesses
the generative capabilities of Large Language Models (LLMs) to purify
adversarial text without the need to explicitly characterize the discrete noise
perturbations. We utilize prompt engineering to exploit LLMs for recovering the
purified examples for given adversarial examples such that they are
semantically similar and correctly classified. Our proposed method demonstrates
remarkable performance over various classifiers, improving their accuracy under
the attack by over 65% on average.