Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Authors: Sergey Berezin, Reza Farahbakhsh, Noel Crespi | Published: 2024-09-27 | Updated: 2025-09-24

2024.09.272025.09.26

Authors: Sergey Berezin, Reza Farahbakhsh, Noel Crespi
Published: 2024-09-27 | Updated: 2025-09-24

Source: https://arxiv.org/abs/2409.18708

PDF: https://arxiv.org/pdf/2409.18708

Labels Predicted by AI

Prompt leaking Natural Language Processing Token Compression Framework

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.