Phare: A Safety Probe for Large Language Models

TOP Literature Database Phare: A Safety Probe for Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.11365

PDF

https://arxiv.org/pdf/2505.11365

Paper Information

Author: Pierre Le Jeune,Benoît Malézieux,Weixuan Xiao,Matteo Dora
Published: 5-17-2025
Updated: 5-26-2025
Affiliation: Giskard AI
Country: France
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Hallucination Bias Mitigation Techniques RAG

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

External Datasets

Phare