AIセキュリティマップにマッピングされた外部作用的側面における負の影響「AIが消費者の意思決定を不正に操作」のセキュリティ対象、それをもたらす攻撃・要因、および防御手法・対策を示しています。
セキュリティ対象
- 消費者
攻撃・要因
- 完全性の毀損
- 説明可能性の毀損
- 制御可能性の毀損
- 出力の公平性の毀損
防御手法・対策
- ヒューマンインザループ
- アライメント
- XAI(説明可能なAI)
- 不確実性の定量化
- 要因となる要素の毀損を防ぐ対策
開発・活用における適用フェーズ
1. データ収集・前処理
2. モデルの選定・学習・検証
- ヒューマンインザループ
- アライメント
- 不確実性の定量化
3. システムの実装
4. システムの提供・運用・保守
- ヒューマンインザループ
- XAI(説明可能なAI)
- 不確実性の定量化
5. システムの利用
参考文献
ヒューマンインザループ
アライメント
- Training language models to follow instructions with human feedback, 2022.0
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022.0
- Constitutional AI: Harmlessness from AI Feedback, 2022.0
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023.0
- A General Theoretical Paradigm to Understand Learning from Human Preferences, 2023.0
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023.0
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023.0
- Self-Rewarding Language Models, 2024.0
- KTO: Model Alignment as Prospect Theoretic Optimization, 2024.0
- SimPO: Simple Preference Optimization with a Reference-Free Reward, 2024.0
XAI(説明可能なAI)
- Visualizing and Understanding Convolutional Networks, 2014.0
- Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, 2014.0
- Understanding Deep Image Representations by Inverting Them, 2014.0
- “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, 2016.0
- A Unified Approach to Interpreting Model Predictions, 2017.0
- Learning Important Features Through Propagating Activation Differences, 2017.0
- Understanding Black-box Predictions via Influence Functions, 2017.0
- Interpretable Explanations of Black Boxes by Meaningful Perturbation, 2017.0
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), 2018.0
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, 2019.0
不確実性の定量化
- Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding, 2015.0
- Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, 2016.0
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, 2017.0
- Predictive Uncertainty Estimation via Prior Networks, 2018.0
- Evidential Deep Learning to Quantify Classification Uncertainty, 2018.0
- Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift, 2019.0
- Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods, 2021.0
