Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

TOP 文献データベース Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2409.18708

PDF

https://arxiv.org/pdf/2409.18708

文献情報

作者: Sergey Berezin,Reza Farahbakhsh,Noel Crespi
公開日: 2024-9-27
更新日: 2025-9-24
所属機関: SAMOVAR, Télécom SudParis
所属の国: France
会議名

AIにより推定されたラベル

プロンプトリーキング自然言語処理トークン圧縮フレームワーク

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.