AIセキュリティポータルbot

“Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence

Authors: Shaopeng Fu, Liang Ding, Di Wang | Published: 2025-02-06
Prompt Injection
Large Language Model
Adversarial Training

Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data

Authors: Huawei Lin, Jun Woo Chung, Yingjie Lao, Weijie Zhao | Published: 2025-02-03
Online Learning

Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

Authors: Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi | Published: 2025-02-03
Learning-to-Defer
Adversarial Example
Adversarial Training

AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement

Authors: J Rosser, Jakob Nicolaus Foerster | Published: 2025-02-02 | Updated: 2025-04-14
LLM Performance Evaluation
Multi-Objective Optimization
Safety Alignment

Safety at Scale: A Comprehensive Survey of Large Model Safety

Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang | Published: 2025-02-02 | Updated: 2025-03-19
Indirect Prompt Injection
Prompt Injection
Attack Method

LLM Safety Alignment is Divergence Estimation in Disguise

Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing | Published: 2025-02-02
Prompt Injection
Convergence Analysis
Large Language Model
Safety Alignment

Byzantine-Resilient Zero-Order Optimization for Communication-Efficient Heterogeneous Federated Learning

Authors: Maximilian Egger, Mayank Bakshi, Rawad Bitar | Published: 2025-01-31
Convergence Guarantee
Convergence Analysis
Communication Efficiency

BounTCHA: A CAPTCHA Utilizing Boundary Identification in Guided Generative AI-extended Videos

Authors: Lehao Lin, Ke Wang, Maha Abdallah, Wei Cai | Published: 2025-01-30 | Updated: 2025-04-01
CAPTCHA
Video Reliability Assurance
Vulnerability of Adversarial Examples

Smoothed Embeddings for Robust Language Models

Authors: Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang | Published: 2025-01-27
Prompt Injection
Membership Inference
Adversarial Training

Improving Network Threat Detection by Knowledge Graph, Large Language Model, and Imbalanced Learning

Authors: Lili Zhang, Quanyan Zhu, Herman Ray, Ying Xie | Published: 2025-01-26
Network Threat Detection
User Activity Analysis
Improvement of Learning