These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Recently, large language models (LLMs) have emerged as a notable field,
attracting significant attention for its ability to automatically generate
intelligent contents for various application domains. However, LLMs still
suffer from significant security and privacy issues. For example, LLMs might
expose user privacy from hacking attacks or targeted prompts. To address this
problem, this paper introduces a novel machine unlearning framework into LLMs.
Our objectives are to make LLMs not produce harmful, hallucinatory, or
privacy-compromising responses, while retaining their standard output
capabilities. To accomplish this, we use an evaluative model to pinpoint
dialogues needing unlearning. We also establish a distance loss to function as
the model's negative loss, diverting it from previous undesirable outputs.
Furthermore, we determine the expected output's cluster mean to formulate a
positive loss, directing the model's outputs toward preferable outcomes without
compromising its reasoning abilities and performance. Experimental results show
that our approach effectively meets unlearning objectives without substantially
compromising model performance.
External Datasets
PKU-SafeRLHF
HaluEval
TruthfulQA
BookCorpus
References
IEEE Trans. Dependable Secur. Comput.
Stealthy and flexible trojan in deep learning framework
Y. Wang, K. Chen, Y. Tan, S. Huang, W. Ma, Y. Li
Published: 2023
Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models