These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Moderating offensive, hateful, and toxic language has always been an
important but challenging topic in the domain of safe use in NLP. The emerging
large language models (LLMs), such as ChatGPT, can potentially further
accentuate this threat. Previous works have discovered that ChatGPT can
generate toxic responses using carefully crafted inputs. However, limited
research has been done to systematically examine when ChatGPT generates toxic
responses. In this paper, we comprehensively evaluate the toxicity in ChatGPT
by utilizing instruction-tuning datasets that closely align with real-world
scenarios. Our results show that ChatGPT's toxicity varies based on different
properties and settings of the prompts, including tasks, domains, length, and
languages. Notably, prompts in creative writing tasks can be 2x more likely
than others to elicit toxic responses. Prompting in German and Portuguese can
also double the response toxicity. Additionally, we discover that certain
deliberately toxic prompts, designed in earlier studies, no longer yield
harmful responses. We hope our discoveries can guide model developers to better
regulate these AI systems and the users to avoid undesirable outputs.