This page provides the security targets of negative impacts “Violation of human dignity caused by discriminatory AI output” in the external influence aspect in the AI Security Map, as well as the attacks and factors that cause them, and the corresponding defense methods and countermeasures.
Security target
- Consumer
Attack or cause
- Confidentiality breach
- Integrity violation
- Degradation of output fairness
- Degradation of accuracy
- Degradation of controllability
- Reliability violation
- Ethics violation
Defensive method or countermeasure
- Fairness evaluation of models
- Bias Detection in AI Outputs
- Debiasing of training data
- Development of fair AI models
- AI alignment
- LLM guardrails
References
Bias Detection in AI Outputs
- Defending Against Neural Fake News, 2019
- Real or Fake? Learning to Discriminate Machine from Human Generated Text, 2019
- Measuring Bias in Contextualized Word Representations, 2019
- Automatic Detection of Generated Text is Easiest when Humans are Fooled, 2020
- Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases, 2021
- Toxicity Detection with Generative Prompt-based Inference, 2022
- DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature, 2023
- Gender bias and stereotypes in Large Language Models, 2023
- Measuring Implicit Bias in Explicitly Unbiased Large Language Models, 2024
- Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models, 2024
- Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct, 2025
AI alignment
- Training language models to follow instructions with human feedback, 2022
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022
- Constitutional AI: Harmlessness from AI Feedback, 2022
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023
- A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023
- Self-Rewarding Language Models, 2024
- KTO: Model Alignment as Prospect Theoretic Optimization, 2024
- SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024