While Large Language Models (LLMs) have seen widespread applications across
numerous fields, their limited interpretability poses concerns regarding their
safe operations from multiple aspects, e.g., truthfulness, robustness, and
fairness. Recent research has started developing quality assurance methods for
LLMs, introducing techniques such as offline detector-based or uncertainty
estimation methods. However, these approaches predominantly concentrate on
post-generation analysis, leaving the online safety analysis for LLMs during
the generation phase an unexplored area. To bridge this gap, we conduct in this
work a comprehensive evaluation of the effectiveness of existing online safety
analysis methods on LLMs. We begin with a pilot study that validates the
feasibility of detecting unsafe outputs in the early generation process.
Following this, we establish the first publicly available benchmark of online
safety analysis for LLMs, including a broad spectrum of methods, models, tasks,
datasets, and evaluation metrics. Utilizing this benchmark, we extensively
analyze the performance of state-of-the-art online safety analysis methods on
both open-source and closed-source LLMs. This analysis reveals the strengths
and weaknesses of individual methods and offers valuable insights into
selecting the most appropriate method based on specific application scenarios
and task requirements. Furthermore, we also explore the potential of using
hybridization methods, i.e., combining multiple methods to derive a collective
safety conclusion, to enhance the efficacy of online safety analysis for LLMs.
Our findings indicate a promising direction for the development of innovative
and trustworthy quality assurance methodologies for LLMs, facilitating their
reliable deployments across diverse domains.