With the development of large language models (LLMs), detecting whether text
is generated by a machine becomes increasingly challenging in the face of
malicious use cases like the spread of false information, protection of
intellectual property, and prevention of academic plagiarism. While
well-trained text detectors have demonstrated promising performance on unseen
test data, recent research suggests that these detectors have vulnerabilities
when dealing with adversarial attacks such as paraphrasing. In this paper, we
propose a framework for a broader class of adversarial attacks, designed to
perform minor perturbations in machine-generated content to evade detection. We
consider two attack settings: white-box and black-box, and employ adversarial
learning in dynamic scenarios to assess the potential enhancement of the
current detection model's robustness against such attacks. The empirical
results reveal that the current detection models can be compromised in as
little as 10 seconds, leading to the misclassification of machine-generated
text as human-written content. Furthermore, we explore the prospect of
improving the model's robustness over iterative adversarial learning. Although
some improvements in model robustness are observed, practical applications
still face significant challenges. These findings shed light on the future
development of AI-text detectors, emphasizing the need for more accurate and
robust detection methods.