Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both. | Towards Data Science

Created 5/31/2026 at 5:04:53 AMEdited 5/31/2026 at 5:11:42 AM

Prefill processes all input tokens simultaneously. For a 4,096-token prompt, the attention computation involves large matrix multiplications across the full sequence length. This is compute-bound work. The GPU’s tensor cores are the bottleneck. On an H100 SXM, prefill achieves 200-400 arithmetic operations per byte of memory accessed. Utilization sits between 90% and 95%. The memory bandwidth, at 3.35 TB/s, is barely taxed.

Decode generates one token at a time. Each step reads the entire KV-cache from HBM to compute a single attention output. The tensor cores finish in microseconds and then wait for the next memory read. Arithmetic intensity drops to 60-80 ops/byte. GPU utilization falls to 20-40%. The tensor cores sit idle while the memory bus saturates.

Disaggregated inference runs prefill and decode on separate GPU pools connected by a fast network. A request arrives, gets routed to a prefill worker, which processes the full prompt and generates the KV-cache. That cache is then transferred over the network to a decode worker, which handles the autoregressive token generation.

Disaggregation is not free. The KV-cache produced during prefill has to move from the prefill GPU to the decode GPU over the network, and these caches are not small.

For a 70B parameter model using grouped-query attention with 80 layers, 8 KV heads per layer, 128 dimensions per head, stored in FP16: each token’s KV state is 327,680 bytes. A 4,096-token prompt produces 1.34 GB of KV-cache. That entire block has to transfer before the decode worker can begin generating.

Public