These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Model extraction attacks (MEAs) on large language models (LLMs) have received
increasing attention in recent research. However, existing attack methods
typically adapt the extraction strategies originally developed for deep neural
networks (DNNs). They neglect the underlying inconsistency between the training
tasks of MEA and LLM alignment, leading to suboptimal attack performance. To
tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel
model extraction algorithm specifically designed for LLMs. In particular, LoRD
employs a newly defined policy-gradient-style training task that utilizes the
responses of victim model as the signal to guide the crafting of preference for
the local model. Theoretical analyses demonstrate that I) The convergence
procedure of LoRD in model extraction is consistent with the alignment
procedure of LLMs, and II) LoRD can reduce query complexity while mitigating
watermark protection through our exploration-based stealing. Extensive
experiments validate the superiority of our method in extracting various
state-of-the-art commercial LLMs. Our code is available at:
https://github.com/liangzid/LoRD-MEA .