Negative impact “Using AI for purposes that violate the law”

This page provides the security targets of negative impacts “Using AI for purposes that violate the law” in the external influence aspect in the AI Security Map, as well as the attacks and factors that cause them, and the corresponding defense methods and countermeasures.

Security target

Society

Attack or cause

Confidentiality breach
Abuse of availability
Abuse of accuracy
Degradation of controllability

Defensive method or countermeasure

AI alignment
AI access control

References

AI alignment

Training language models to follow instructions with human feedback, 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022
Constitutional AI: Harmlessness from AI Feedback, 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023
RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023
Self-Rewarding Language Models, 2024
KTO: Model Alignment as Prospect Theoretic Optimization, 2024
SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024