Disabling Safety Mechanisms of LLM

LLM Watermark Evasion via Bias Inversion

Authors: Jeongyeon Hwang, Sangdon Park, Jungseul Ok | Published: 2025-09-27 | Updated: 2025-10-01
Disabling Safety Mechanisms of LLM
Model Inversion
Statistical Testing

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen | Published: 2025-09-26 | Updated: 2025-09-30
Disabling Safety Mechanisms of LLM
Self-Attention Mechanism
Interpretability

RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

Authors: Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, Shiyu Liang | Published: 2025-09-25
Disabling Safety Mechanisms of LLM
Prompt Injection
Watermark Design

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Authors: Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He | Published: 2025-09-24
Disabling Safety Mechanisms of LLM
Prompt Injection
Generative Model

Send to which account? Evaluation of an LLM-based Scambaiting System

Authors: Hossein Siadati, Haadi Jafarian, Sima Jafarikhah | Published: 2025-09-10
Disabling Safety Mechanisms of LLM
Research Methodology
詐欺対策

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Authors: Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, Wang Kailong | Published: 2025-09-08
Disabling Safety Mechanisms of LLM
Calculation of Output Harmfulness
Attack Detection Method

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint

Authors: Zhenhua Xu, Meng Han, Wenpeng Xing | Published: 2025-09-03
Disabling Safety Mechanisms of LLM
Data Protection Method
Prompt validation

Consiglieres in the Shadow: Understanding the Use of Uncensored Large Language Models in Cybercrimes

Authors: Zilong Lin, Zichuan Li, Xiaojing Liao, XiaoFeng Wang | Published: 2025-08-18
Disabling Safety Mechanisms of LLM
Data Generation Method
Calculation of Output Harmfulness

PRISON: Unmasking the Criminal Potential of Large Language Models

Authors: Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang | Published: 2025-06-19 | Updated: 2025-08-04
Disabling Safety Mechanisms of LLM
法執行回避
Research Methodology

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

Authors: Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, Shouling Ji | Published: 2025-06-11
Disabling Safety Mechanisms of LLM
Prompt Injection
Adversarial attack