These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the growing use of large language models (LLMs) hosted on cloud
platforms to offer inference services, privacy concerns about the potential
leakage of sensitive information are escalating. Secure multi-party computation
(MPC) is a promising solution to protect the privacy in LLM inference. However,
MPC requires frequent inter-server communication, causing high performance
overhead.
Inspired by the prevalent activation sparsity of LLMs, where most neuron are
not activated after non-linear activation functions, we propose an efficient
private inference system, Comet. This system employs an accurate and fast
predictor to predict the sparsity distribution of activation function output.
Additionally, we introduce a new private inference protocol. It efficiently and
securely avoids computations involving zero values by exploiting the spatial
locality of the predicted sparse distribution. While this computation-avoidance
approach impacts the spatiotemporal continuity of KV cache entries, we address
this challenge with a low-communication overhead cache refilling strategy that
merges miss requests and incorporates a prefetching mechanism. Finally, we
evaluate Comet on four common LLMs and compare it with six state-of-the-art
private inference systems. Comet achieves a 1.87x-2.63x speedup and a
1.94x-2.64x communication reduction.