Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Authors: Harsh Chaudhari, Ethan Rathbum, Hanna Foerster, Jamie Hayes, Matthew Jagielski, Milad Nasr, Ilia Shumailov, Alina Oprea | Published: 2026-01-27

2026.01.27

Authors: Harsh Chaudhari, Ethan Rathbum, Hanna Foerster, Jamie Hayes, Matthew Jagielski, Milad Nasr, Ilia Shumailov, Alina Oprea
Published: 2026-01-27

Source: https://arxiv.org/abs/2601.19061

PDF: https://arxiv.org/pdf/2601.19061

AIにより推定されたラベル

データ汚染検出出力の有害度の算出 LLM活用

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models’ capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our “Thought-Transfer” attack can influence the LLM output on a target task by manipulating only the training samples’ CoT traces, while leaving the queries and answers unchanged, resulting in a form of “clean label” poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70