Achieve 23x LLM Inference Throughput & Reduce p50 Latency

https://www.anyscale.com/blog/continuous-batching-llm-inference

Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.

One recent such proposed optimization is continuous batching, also known as dynamic batching, or batching with iteration-level scheduling. We wanted to see how this optimization performs. We will get into details below, including how we simulate a production workload, but to summarize our findings:

Up to 23x throughput improvement using continuous batching and continuous batching-specific memory optimizations (using vLLM).
8x throughput over naive batching by using continuous batching (both on Ray Serve and Hugging Face’s text-generation-inference).
4x throughput over naive batching by using an optimized model implementation (NVIDIA’s FasterTransformer).

You can try out continuous batching today: see this example to run vLLM on Ray Serve.

The remainder of this blog is structured as follows:

We’ll cover the basics of how LLM inference works and highlight inefficiencies in traditional request-based dynamic batching policies.
We’ll introduce continuous batching and how it answers many of the inefficiencies of request-based dynamic batching.
We then discuss our benchmarks and the implications this has on how to serve LLM models cost-effectively.

The basics of LLM inference

There is a lot to know about LLM inference, and we refer users to Efficient Inference on a Single GPU and Optimization story: Bloom inference for more detail. However, at a high level, LLM inference is pretty straightforward.

For each request:

You start with a sequence of tokens (called the "prefix" or "prompt").
The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length.

This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", “a”, "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCII characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.

Simplified LLM inference. This toy example shows a hypothetical model which supports a maximum sequence length of 8 tokens (T1, T2, …, T8). Starting from the prompt tokens (yellow), the iterative process generates a single token at a time (blue). Once the model generates an end-of-sequence token (red), the generation loop stops. This example shows a batch of only one input sequence, so the batch size is 1.