アライメント

Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs

Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24
アライメント
プライバシー保護データマイニング
透かし

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01
アライメント
最適化アルゴリズムの選択と評価
深層学習

KTO: Model Alignment as Prospect Theoretic Optimization

Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19
アライメント
データ生成手法
深層学習

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08
アライメント
モデルアーキテクチャ
深層学習

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa | Published: 2023-12-07
アライメント
データ生成手法
リスク分析手法

A General Theoretical Paradigm to Understand Learning from Human Preferences

Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos | Published: 2023-10-18 | Updated: 2023-11-22
アライメント
データ生成手法
深層学習

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn | Published: 2023-05-29 | Updated: 2024-07-29
アライメント
報酬メカニズム設計
強化学習最適化

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Authors: Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang | Published: 2023-04-11 | Updated: 2023-10-07
アライメント
報酬メカニズム設計
強化学習最適化

Constitutional AI: Harmlessness from AI Feedback

Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan | Published: 2022-12-15
アライメント
プロンプトインジェクション
性能評価

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Authors: Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan | Published: 2022-04-12
アライメント
強化学習最適化
性能評価