These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have demonstrated remarkable intelligence across
various tasks, which has inspired the development and widespread adoption of
LLM-as-a-Judge systems for automated model testing, such as red teaming and
benchmarking. However, these systems are susceptible to adversarial attacks
that can manipulate evaluation outcomes, raising concerns about their
robustness and, consequently, their trustworthiness. Existing evaluation
methods adopted by LLM-based judges are often piecemeal and lack a unified
framework for comprehensive assessment. Furthermore, prompt template and model
selections for improving judge robustness have been rarely explored, and their
performance in real-world settings remains largely unverified. To address these
gaps, we introduce RobustJudge, a fully automated and scalable framework
designed to systematically evaluate the robustness of LLM-as-a-Judge systems.
RobustJudge investigates the impact of attack methods and defense strategies
(RQ1), explores the influence of prompt template and model selection (RQ2), and
assesses the robustness of real-world LLM-as-a-Judge applications (RQ3).Our
main findings are: (1) LLM-as-a-Judge systems are still vulnerable to a range
of adversarial attacks, including Combined Attack and PAIR, while defense
mechanisms such as Re-tokenization and LLM-based Detectors offer improved
protection; (2) Robustness is highly sensitive to the choice of prompt template
and judge models. Our proposed prompt template optimization method can improve
robustness, and JudgeLM-13B demonstrates strong performance as a robust
open-source judge; (3) Applying RobustJudge to Alibaba's PAI platform reveals
previously unreported vulnerabilities. The source code of RobustJudge is
provided at https://github.com/S3IC-Lab/RobustJudge.