AIセキュリティマップにマッピングされた外部作用的側面における負の影響「AIによる非倫理的な出力や動作」のセキュリティ対象、それをもたらす攻撃・要因、および防御手法・対策を示しています。
セキュリティ対象
- 非消費者
- 消費者
- 社会
攻撃・要因
- 完全性の毀損
- LLMの安全機構の解除
防御手法・対策
参考文献
LLMの安全機構の解除
教育やフォローアップ
- What Students Can Learn About Artificial Intelligence — Recommendations for K-12 Computing Education, 2022
- Learning to Prompt in the Classroom to Understand AI Limits: A pilot study, 2023
- Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study, 2024
- The Essentials of AI for Life and Society: An AI Literacy Course for the University Community, 2025
アライメント
- Training language models to follow instructions with human feedback, 2022
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022
- Constitutional AI: Harmlessness from AI Feedback, 2022
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023
- A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023
- Self-Rewarding Language Models, 2024
- KTO: Model Alignment as Prospect Theoretic Optimization, 2024
- SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024