Adversarial Contrastive Learning for LLM Quantization Attacks

Authors: Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li | Published: 2026-01-06

2026.01.06

Authors: Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li
Published: 2026-01-06

Source: https://arxiv.org/abs/2601.02680

PDF: https://arxiv.org/pdf/2601.02680

AIにより推定されたラベル

量子化とプライバシー LLMの安全機構の解除モデル抽出攻撃

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL’s remarkable effectiveness, achieving attack success rates of 86.00