These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g.,
Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large
Language Models (LLMs) domain, their susceptibility to security threats remains
a critical vulnerability. This weakness is particularly evident in
Chain-of-Thought (CoT) generation processes, where adversarial methods like
backdoor prompt attacks can systematically subvert the model's core reasoning
mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this
vulnerability through exploiting prompt controllability, simultaneously
degrading both CoT safety and task performance with low-cost interventions. To
address this compounded security-performance vulnerability, we propose Thought
Purity (TP): a defense paradigm that systematically strengthens resistance to
malicious content while preserving operational efficacy. Our solution achieves
this through three synergistic components: (1) a safety-optimized data
processing pipeline (2) reinforcement learning-enhanced rule constraints (3)
adaptive monitoring metrics. Our approach establishes the first comprehensive
defense mechanism against CoTA vulnerabilities in reinforcement
learning-aligned reasoning systems, significantly advancing the
security-functionality equilibrium for next-generation AI architectures.