Machine Unlearning in Large Language Models

TOP 文献データベース Machine Unlearning in Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2404.16841

PDF

https://arxiv.org/pdf/2404.16841

文献情報

作者: Kongyang Chen;Zixin Wang;Bing Mi;Waixi Liu;Shaowei Wang;Xiaojun Ren;Jiaxing Shen
公開日: 2024-2-3
所属機関: Institute of Artificial Intelligence, Guangzhou University
所属の国: China
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プライバシー保護手法倫理的ガイドライン遵守モデル性能評価

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Recently, large language models (LLMs) have emerged as a notable field, attracting significant attention for its ability to automatically generate intelligent contents for various application domains. However, LLMs still suffer from significant security and privacy issues. For example, LLMs might expose user privacy from hacking attacks or targeted prompts. To address this problem, this paper introduces a novel machine unlearning framework into LLMs. Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses, while retaining their standard output capabilities. To accomplish this, we use an evaluative model to pinpoint dialogues needing unlearning. We also establish a distance loss to function as the model's negative loss, diverting it from previous undesirable outputs. Furthermore, we determine the expected output's cluster mean to formulate a positive loss, directing the model's outputs toward preferable outcomes without compromising its reasoning abilities and performance. Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.

外部データセット

PKU-SafeRLHF

HaluEval

TruthfulQA

BookCorpus