アライメント

LlamaFirewall: An open source guardrail system for building secure AI agents

Authors: Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe | Published: 2025-05-06
LLMセキュリティ
アライメント
プロンプトインジェクション

Bridging Expertise Gaps: The Role of LLMs in Human-AI Collaboration for Cybersecurity

Authors: Shahroz Tariq, Ronal Singh, Mohan Baruwal Chhetri, Surya Nepal, Cecile Paris | Published: 2025-05-06
LLMとの協力効果
アライメント
参加者の質問分析

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Junyuan Mao, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Yi Ding, Donghai Hong, Jiaming Ji, Xinfeng Li, Yifan Jiang, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Yanwei Yue, Wenke Huang, Guancheng Wan, Tianlin Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Tianwei Zhang, Xingjun Ma, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Yuval Elovici, Bhavya Kailkhura, Bo Li, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Shuicheng Yan, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu | Published: 2025-04-22
アライメント
データ生成の安全性
プロンプトインジェクション

aiXamine: Simplified LLM Safety and Security

Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil | Published: 2025-04-21 | Updated: 2025-04-23
LLM性能評価
アライメント
パフォーマンス評価

GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms

Authors: Sinan He, An Wang | Published: 2025-04-17
アライメント
プロンプトインジェクション
脆弱性研究

Personalized Attacks of Social Engineering in Multi-turn Conversations — LLM Agents for Simulation and Detection

Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu | Published: 2025-03-18
アライメント
ソーシャルエンジニアリング攻撃
攻撃手法

Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs

Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24
アライメント
プライバシー保護データマイニング
透かし

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01
アライメント
最適化アルゴリズムの選択と評価
深層学習

KTO: Model Alignment as Prospect Theoretic Optimization

Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19
アライメント
データ生成手法
深層学習

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08
アライメント
モデルアーキテクチャ
深層学習