These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the development of Large Language Models (LLMs), numerous efforts have
revealed their vulnerabilities to jailbreak attacks. Although these studies
have driven the progress in LLMs' safety alignment, it remains unclear whether
LLMs have internalized authentic knowledge to deal with real-world crimes, or
are merely forced to simulate toxic language patterns. This ambiguity raises
concerns that jailbreak success is often attributable to a hallucination loop
between jailbroken LLM and judger LLM. By decoupling the use of jailbreak
techniques, we construct knowledge-intensive Q\&A to investigate the misuse
threats of LLMs in terms of dangerous knowledge possession, harmful task
planning utility, and harmfulness judgment robustness. Experiments reveal a
mismatch between jailbreak success rates and harmful knowledge possession in
LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness
judgments on toxic language patterns. Our study reveals a gap between existing
LLM safety assessments and real-world threat potential.