Negative impact “Unfair biased and discriminatory output”

This page provides the security targets of negative impacts “Unfair biased and discriminatory output” in the external influence aspect in the AI Security Map, as well as the attacks and factors that cause them, and the corresponding defense methods and countermeasures.

Security target

Consumer

Attack or cause

Integrity violation
Degradation of controllability
Degradation of output fairness

Defensive method or countermeasure

Defensive method for integrity
AI alignment
Countermeasures for output fairness
Detection of bias in AI output

References

AI alignment

Training language models to follow instructions with human feedback, 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022
Constitutional AI: Harmlessness from AI Feedback, 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023
RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023
Self-Rewarding Language Models, 2024
KTO: Model Alignment as Prospect Theoretic Optimization, 2024
SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024