Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

TOP 文献データベース Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2510.03567

PDF

https://arxiv.org/pdf/2510.03567

文献情報

作者: Fatmazohra Rezkellah,Ramzi Dakhmouche
公開日: 2025-10-4
更新日: 2025-10-17
所属機関: Department of Computer Science, Université Paris-Dauphine
所属の国: France
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

AIによる出力の識別大規模言語モデルロバスト性

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

外部データセット

custom obedience dataset

HarmBench