過剰拒否緩和

Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Authors: Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He | Published: 2025-01-18
プロンプトインジェクション
敵対的訓練
過剰拒否緩和