"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

TOP Literature Database "Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2409.02718

PDF

https://arxiv.org/pdf/2409.02718

Paper Information

Author: Zi Liang,Qingqing Ye,Yanyun Wang,Sen Zhang,Yaxin Xiao,Ronghua Li,Jianliang Xu,Haibo Hu
Published: 9-4-2024
Updated: 5-19-2025
Affiliation: The Hong Kong Polytechnic University
Country: Hong Kong, China
Conference

Labels Estimated by AI

Model Extraction Attack Watermarking Technology LLM Security

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA .