Despite the growing interest in jailbreak methods as an effective red-teaming
tool for building safe and responsible large language models (LLMs), flawed
evaluation system designs have led to significant discrepancies in their
effectiveness assessments. We conduct a systematic measurement study based on
37 jailbreak studies since 2022, focusing on both the methods and the
evaluation systems they employ. We find that existing evaluation systems lack
case-specific criteria, resulting in misleading conclusions about their
effectiveness and safety implications. This paper advocates a shift to a more
nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel
benchmark comprising a curated harmful question dataset, detailed case-by-case
evaluation guidelines and an evaluation system integrated with these guidelines
-- GuidedEval. Experiments demonstrate that GuidedBench offers more accurate
measurements of jailbreak performance, enabling meaningful comparisons across
methods and uncovering new insights overlooked in previous evaluations.
GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore,
we observe that incorporating guidelines can enhance the effectiveness of
jailbreak methods themselves, offering new insights into both attack strategies
and evaluation paradigms.