The wide deployment of Large Language Models (LLMs) has given rise to strong
demands for optimizing their inference performance. Today's techniques serving
this purpose primarily focus on reducing latency and improving throughput
through algorithmic and hardware enhancements, while largely overlooking their
privacy side effects, particularly in a multi-user environment. In our
research, for the first time, we discovered a set of new timing side channels
in LLM systems, arising from shared caches and GPU memory allocations, which
can be exploited to infer both confidential system prompts and those issued by
other users. These vulnerabilities echo security challenges observed in
traditional computing systems, highlighting an urgent need to address potential
information leakage in LLM serving infrastructures. In this paper, we report
novel attack strategies designed to exploit such timing side channels inherent
in LLM deployments, specifically targeting the Key-Value (KV) cache and
semantic cache widely used to enhance LLM inference performance. Our approach
leverages timing measurements and classification models to detect cache hits,
allowing an adversary to infer private prompts with high accuracy. We also
propose a token-by-token search algorithm to efficiently recover shared prompt
prefixes in the caches, showing the feasibility of stealing system prompts and
those produced by peer users. Our experimental studies on black-box testing of
popular online LLM services demonstrate that such privacy risks are completely
realistic, with significant consequences. Our findings underscore the need for
robust mitigation to protect LLM systems against such emerging threats.