As Large Language Models are rapidly deployed across diverse applications
from healthcare to financial advice, safety evaluation struggles to keep pace.
Current benchmarks focus on single-turn interactions with generic policies,
failing to capture the conversational dynamics of real-world usage and the
application-specific harms that emerge in context. Such potential oversights
can lead to harms that go unnoticed in standard safety benchmarks and other
current evaluation methodologies. To address these needs for robust AI safety
evaluation, we introduce SAGE (Safety AI Generic Evaluation), an automated
modular framework designed for customized and dynamic harm evaluations. SAGE
employs prompted adversarial agents with diverse personalities based on the Big
Five model, enabling system-aware multi-turn conversations that adapt to target
applications and harm policies. We evaluate seven state-of-the-art LLMs across
three applications and harm policies. Multi-turn experiments show that harm
increases with conversation length, model behavior varies significantly when
exposed to different user personalities and scenarios, and some models minimize
harm via high refusal rates that reduce usefulness. We also demonstrate policy
sensitivity within a harm category where tightening a child-focused sexual
policy substantially increases measured defects across applications. These
results motivate adaptive, policy-aware, and context-specific testing for safer
real-world deployment.