This page provides the security targets of negative impacts “Unethical output or actions by AI” in the external influence aspect in the AI Security Map, as well as the attacks and factors that cause them, and the corresponding defense methods and countermeasures.
Security target
- Non-consumer
- Consumer
- Society
Attack or cause
- Integrity violation
- Jailbreak
Defensive method or countermeasure
References
Jailbreak
Education and follow-up
- What Students Can Learn About Artificial Intelligence — Recommendations for K-12 Computing Education, 2022
- Learning to Prompt in the Classroom to Understand AI Limits: A pilot study, 2023
- Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study, 2024
- The Essentials of AI for Life and Society: An AI Literacy Course for the University Community, 2025
AI alignment
- Training language models to follow instructions with human feedback, 2022
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022
- Constitutional AI: Harmlessness from AI Feedback, 2022
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023
- A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023
- Self-Rewarding Language Models, 2024
- KTO: Model Alignment as Prospect Theoretic Optimization, 2024
- SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024