Measuring Implicit Bias in Explicitly Unbiased Large Language Models

TOP 文献データベース Measuring Implicit Bias in Explicitly Unbiased Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2402.04105

PDF

https://arxiv.org/pdf/2402.04105

文献情報

作者: Xuechunzi Bai,Angelina Wang,Ilia Sucholutsky,Thomas L. Griffiths
公開日: 2024-2-7
更新日: 2024-5-24
所属機関: Department of Psychology
所属の国: United States of America
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

AIによる出力のバイアスの検出大規模言語モデルアルゴリズムの公平性

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

外部データセット

Implicit Association Test

Bias Benchmark for QA

Open-Ended Language Generation Dataset

70 hypothetical decision scenarios