Alignment

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Authors: Xiangtao Meng, Tianshuo Cong, Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo, Xiaoyun Wang | Published: 2025-10-09

Alignment

Indirect Prompt Injection

Defense Effectiveness Analysis

2025.10.09 2025.10.11

Literature Database

Investigating Security Implications of Automatically Generated Code on the Software Supply Chain

Authors: Xiaofan Li, Xing Gao | Published: 2025-09-24

Alignment

Indirect Prompt Injection

Vulnerability Research

2025.09.24 2025.09.26

Literature Database

Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability

Authors: Shova Kuikel, Aritran Piplai, Palvi Aggarwal | Published: 2025-06-16

Alignment

Prompt Injection

Large Language Model

2025.06.16 2025.06.18

Literature Database

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

Authors: Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng | Published: 2025-06-14 | Updated: 2025-09-30

Alignment

Ethical Statement

Malicious Prompt

2025.06.14 2025.10.02

Literature Database

The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Authors: Songyang Liu, Chaozhuo Li, Jiameng Qiu, Xi Zhang, Feiran Huang, Litian Zhang, Yiming Hei, Philip S. Yu | Published: 2025-06-06 | Updated: 2025-10-30

Alignment

Large Language Model

安全性評価

2025.06.06 2025.11.01

Literature Database

Client-Side Zero-Shot LLM Inference for Comprehensive In-Browser URL Analysis

Authors: Avihay Cohen | Published: 2025-06-04

Alignment

Prompt Injection

Dynamic Analysis

2025.06.04 2025.06.06

Literature Database

MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

Authors: John Halloran | Published: 2025-05-29

Poisoning attack on RAG

Alignment

料理材料

2025.05.29 2025.05.31

Literature Database

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

Authors: Chunlong Xie, Jialing He, Shangwei Guo, Jiacheng Wang, Shudong Zhang, Tianwei Zhang, Tao Xiang | Published: 2025-05-29

Alignment

敵対的オブジェクト生成

Optimization Methods

2025.05.29 2025.05.31

Literature Database

Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Authors: Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun | Published: 2025-05-22

LLM Security

Alignment

Adversarial Learning

2025.05.22 2025.05.28

Literature Database

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Authors: Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, Li Shen | Published: 2025-05-22

Alignment

Indirect Prompt Injection

Calculation of Output Harmfulness

2025.05.22 2025.05.28

Literature Database