Revisiting Character-level Adversarial Attacks for Language Models

TOP 文献データベース Revisiting Character-level Adversarial Attacks for Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2405.04346

PDF

https://arxiv.org/pdf/2405.04346

文献情報

作者: Elias Abad Rocamora;Yongtao Wu;Fanghui Liu;Grigorios G. Chrysos;Volkan Cevher
公開日: 2024-5-7
更新日: 2024-9-5
所属機関: LIONS, École Polytechnique Fédérale de Lausanne
所属の国: Switzerland
会議名

AIにより推定されたラベル

攻撃手法損失関数ウォーターマーキング

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.