These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
CPU-based trusted execution environments (TEEs) and differential privacy (DP)
have gained wide applications for private inference. Due to high inference
latency in TEEs, researchers use partition-based approaches that offload linear
model components to GPUs. However, dense nonlinear layers of large language
models (LLMs) result in significant communication overhead between TEEs and
GPUs. DP-based approaches apply random noise to protect data privacy, but this
compromises LLM performance and semantic understanding. To overcome the above
drawbacks, this paper proposes CMIF, a Confidential and efficient Model
Inference Framework. CMIF confidentially deploys the embedding layer in the
client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes
the Report-Noisy-Max mechanism to protect sensitive inputs with a slight
decrease in model performance. Extensive experiments on Llama-series models
demonstrate that CMIF reduces additional inference overhead in TEEs while
preserving user data privacy.