aiXamine: Simplified LLM Safety and Security Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil | Published: 2025-04-21 | Updated: 2025-04-23 LLM Performance EvaluationAlignmentPerformance Evaluation 2025.04.21 2025.05.27 Literature Database
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms Authors: Sinan He, An Wang | Published: 2025-04-17 AlignmentPrompt InjectionVulnerability Research 2025.04.17 2025.05.27 Literature Database
Personalized Attacks of Social Engineering in Multi-turn Conversations — LLM Agents for Simulation and Detection Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu | Published: 2025-03-18 AlignmentSocial Engineering AttackAttack Method 2025.03.18 2025.05.27 Literature Database
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng | Published: 2025-02-18 | Updated: 2025-05-21 AlignmentText Generation MethodPrompt Injection 2025.02.18 2025.05.28 Literature Database
Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24 AlignmentPrivacy-Preserving Data MiningWatermark 2025.02.10 2025.05.27 Literature Database
SimPO: Simple Preference Optimization with a Reference-Free Reward Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01 AlignmentSelection and Evaluation of Optimization AlgorithmsDeep Learning 2024.05.23 2025.05.27 Literature Database
KTO: Model Alignment as Prospect Theoretic Optimization Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19 AlignmentData Generation MethodDeep Learning 2024.02.02 2025.05.27 Literature Database
Self-Rewarding Language Models Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08 AlignmentModel ArchitectureDeep Learning 2024.01.18 2025.05.27 Literature Database
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa | Published: 2023-12-07 AlignmentData Generation MethodRisk Analysis Method 2023.12.07 2025.05.28 Literature Database
A General Theoretical Paradigm to Understand Learning from Human Preferences Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos | Published: 2023-10-18 | Updated: 2023-11-22 AlignmentData Generation MethodDeep Learning 2023.10.18 2025.05.28 Literature Database