https://www.glean.com/blog/glean-kv-caches-llm-latency

Welcome back to our series on LLM latency! In our previous blog, we took a close look at how input token count impacts the latency of LLM chat tools. In this blog, we’ll explore how KV (key-value) caching impacts time to first token (TTFT) latency and throughput for LLM calls.

The impact of KV caches on TTFT

LLMs are autoregressive models, where the generation of the ith token depends on all the tokens generated prior. This means that computing the attention scores for the ith token involves all the same operations as done for the i-1th token, plus the additional computations for this latest token. This is a great opportunity to cache.

Caching these values has the following implications:

  1. The initiation phase, which we learned in the previous blog refers to generating the first token, is unaffected by the KV caching strategy since there are no previous steps. This phase now populates the KV cache for subsequent stages.
  2. For the decoding phase we no longer use the whole sequence as input but only the last generated token and the KV cache.

So, how does attention computation scale now? As we discussed in the last blog, computing attention scores within a Transformer involves doing matrix multiplications and multiplying a matrix of shape (n, p), with another matrix (p, m), involves approximately 2mnp operations. In the case of the of the attention layer, m and n are both equal to the size of the context window, w, meaning the cost of this operation becomes 2pw^2—a quadratic relation. However, since the query for subsequent generations is now a single token, the computation is linear (2p*1^2).

This explains how completions within a single LLM call benefit from caching, but if this cache is persisted, it can be used across LLM calls. Two possible ways this can be taken advantage of are:

  1. Caching in multi-turn use cases within a single conversation

Within a single conversation, the chat history is an ever growing string prefix that can be cached across turns!

  1. Caching across workflows that have similar prefixes

Being able to reuse the KV cache across requests should enable TTFT latency wins as well as throughput improvements due to the memory footprint of concurrent requests with a shared prefix.

Experiment results