https://medium.com/byte-sized-ai/inference-optimizations-1-continuous-batching-03408c673098

LLM Inference Optimizations — Continuous Batching (Dynamic Batching) and Selective Batching, Orca

Published in

6 min read

Aug 24, 2024

Batching for large language models (LLMs) is challenging due to the iterative nature of their inference. The difficulty arises because requests within a batch can complete at different times, making it complex to release their resources and incorporate new requests that are at varying stages of completion. As a result, the GPU may be underutilized, especially when the generation lengths of sequences in a batch vary significantly from the longest sequence. This inefficiency is depicted in the figure below by the white squares following the end-of-sequence tokens for sequences 1, 3, and 4, indicating unused GPU resources.

figure from [2]

Continuous Batching — ORCA: a distributed serving system for Transformer-based generative models

ORCA, which introduces the concept of Continuous Batching, features iteration-level scheduling and selective batching to effectively address the challenges associated with batching in large language models.