Alignment

aiXamine: Simplified LLM Safety and Security

Authors: Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil | Published: 2025-04-21 | Updated: 2025-04-23

LLM Performance Evaluation

Alignment

Performance Evaluation

2025.04.21 2025.05.27

Literature Database

GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms

Authors: Sinan He, An Wang | Published: 2025-04-17

Alignment

Prompt Injection

Vulnerability Research

2025.04.17 2025.05.27

Literature Database

Personalized Attacks of Social Engineering in Multi-turn Conversations — LLM Agents for Simulation and Detection

Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu | Published: 2025-03-18

Alignment

Social Engineering Attack

Attack Method

2025.03.18 2025.05.27

Literature Database

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng | Published: 2025-02-18 | Updated: 2025-05-21

Alignment

Text Generation Method

Prompt Injection

2025.02.18 2025.05.28

Literature Database

Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMs

Authors: Hiroki Watanabe, Motonobu Uchikoshi | Published: 2025-02-10 | Updated: 2025-04-24

Alignment

Privacy-Preserving Data Mining

Watermark

2025.02.10 2025.05.27

Literature Database

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen | Published: 2024-05-23 | Updated: 2024-11-01

Alignment

Selection and Evaluation of Optimization Algorithms

Deep Learning

2024.05.23 2025.05.27

Literature Database

KTO: Model Alignment as Prospect Theoretic Optimization

Authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela | Published: 2024-02-02 | Updated: 2024-11-19

Alignment

Data Generation Method

Deep Learning

2024.02.02 2025.05.27

Literature Database

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston | Published: 2024-01-18 | Updated: 2024-02-08

Alignment

Model Architecture

Deep Learning

2024.01.18 2025.05.27

Literature Database

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa | Published: 2023-12-07

Alignment

Data Generation Method

Risk Analysis Method

2023.12.07 2025.05.28

Literature Database

A General Theoretical Paradigm to Understand Learning from Human Preferences

Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos | Published: 2023-10-18 | Updated: 2023-11-22

Alignment

Data Generation Method

Deep Learning

2023.10.18 2025.05.28

Literature Database