アライメント

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Junyuan Mao, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Yi Ding, Donghai Hong, Jiaming Ji, Xinfeng Li, Yifan Jiang, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Yanwei Yue, Wenke Huang, Guancheng Wan, Tianlin Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Tianwei Zhang, Xingjun Ma, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Yuval Elovici, Bhavya Kailkhura, Bo Li, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Shuicheng Yan, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu | Published: 2025-04-22
アライメント
データ生成の安全性
プロンプトインジェクション

aiXamine: LLM Safety and Security Simplified

Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil | Published: 2025-04-21
LLM性能評価
アライメント
パフォーマンス評価

GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms

Authors: Sinan He, An Wang | Published: 2025-04-17
アライメント
プロンプトインジェクション
脆弱性研究

Personalized Attacks of Social Engineering in Multi-turn Conversations — LLM Agents for Simulation and Detection

Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu | Published: 2025-03-18
アライメント
ソーシャルエンジニアリング攻撃
攻撃手法

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng | Published: 2025-02-18 | Updated: 2025-05-21
アライメント
テキスト生成手法
プロンプトインジェクション

Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs

Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24
アライメント
プライバシー保護データマイニング
透かし

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01
アライメント
最適化アルゴリズムの選択と評価
深層学習

KTO: Model Alignment as Prospect Theoretic Optimization

Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19
アライメント
データ生成手法
深層学習

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08
アライメント
モデルアーキテクチャ
深層学習

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa | Published: 2023-12-07
アライメント
データ生成手法
リスク分析手法