Alignment

aiXamine: Simplified LLM Safety and Security

Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil | Published: 2025-04-21 | Updated: 2025-04-23
LLM Performance Evaluation
Alignment
Performance Evaluation

GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms

Authors: Sinan He, An Wang | Published: 2025-04-17
Alignment
Prompt Injection
Vulnerability Research

Personalized Attacks of Social Engineering in Multi-turn Conversations — LLM Agents for Simulation and Detection

Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu | Published: 2025-03-18
Alignment
Social Engineering Attack
Attack Method

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng | Published: 2025-02-18 | Updated: 2025-05-21
Alignment
Text Generation Method
Prompt Injection

Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs

Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24
Alignment
Privacy-Preserving Data Mining
Watermark

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01
Alignment
Selection and Evaluation of Optimization Algorithms
Deep Learning

KTO: Model Alignment as Prospect Theoretic Optimization

Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19
Alignment
Data Generation Method
Deep Learning

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08
Alignment
Model Architecture
Deep Learning

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa | Published: 2023-12-07
Alignment
Data Generation Method
Risk Analysis Method

A General Theoretical Paradigm to Understand Learning from Human Preferences

Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos | Published: 2023-10-18 | Updated: 2023-11-22
Alignment
Data Generation Method
Deep Learning