PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
Every token you generate during Large Language Model (LLM) inference silently consumes GPU memory. In the world of high-performance serving, memory is often a tighter bottleneck than compute. With traditional Key-Value (KV) caching, a significant portion of that memory is wasted—never used, never reclaimed, and ultimately preventing higher batch sizes.
The introduction of vLLM and its core innovation, PagedAttention, changed the landscape of AI infrastructure by borrowing a decades-old idea from operating systems. At n1n.ai, we see developers constantly struggling to balance latency and throughput; understanding PagedAttention is the first step toward optimizing your deployment. In this guide, we will break down exactly how it works and why it delivers up to 24× higher throughput than conventional implementations.
What Is a KV Cache and Why Does It Exist?
To understand the solution, we must first understand the problem. LLMs like Llama 3 or DeepSeek-V3 are autoregressive. This means to generate the -th token, the model needs the context of all previous tokens.
Without caching, the model would have to recompute the Key and Value tensors for every single preceding token at every single step. This results in an computational complexity that makes real-time generation impossible. To solve this, we use a KV Cache: we store the previously computed Key and Value tensors in GPU VRAM. Subsequent steps only compute the KV for the new token and attend to the cached values.
However, this cache is massive. Let's look at the math for a typical Llama-2 7B model:
# Rough KV cache size estimation
num_layers = 32
num_heads = 32
head_dim = 128
seq_len = 2048
batch_size = 8
dtype_bytes = 2 # float16
kv_cache_bytes = (
2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
)
print(f"KV cache: {kv_cache_bytes / 1e9:.2f} GB")
# Output: KV cache: 8.59 GB
For just 8 users at a 2k context window, you've already eaten 8.6 GB of VRAM. If you are using a premier API aggregator like n1n.ai, these optimizations are handled behind the scenes, but for self-hosted infrastructure, this is where the trouble begins.
The Problem: Traditional KV Cache and Memory Fragmentation
In traditional frameworks (like HuggingFace Transformers), KV caches are allocated as contiguous blocks of memory. If you set a max_seq_len of 2048, the system reserves space for 2048 tokens the moment the request starts.
This leads to three types of fragmentation:
- Internal Fragmentation: If a user's request only generates 50 tokens but you reserved 2048, the remaining 1998 slots are wasted.
- Reservation Fragmentation: Even if the model might eventually use the space, it sits idle during the early stages of generation.
- External Fragmentation: Because allocations are contiguous and of varying lengths, the GPU memory becomes a "Swiss cheese" of small, unusable gaps.
In practice, this means your GPU might report 80% memory utilization, but only 30-40% of that memory is actually holding useful data. This inefficiency prevents you from increasing your batch size, leading to long queues and high costs.
Inspiration from OS Virtual Memory
The vLLM team (Kwon et al., 2023) realized that LLM memory management in 2023 looked a lot like computer RAM management in the 1960s. The solution was the same: Paging.
In an Operating System, physical RAM is divided into fixed-size frames. A process's virtual address space is divided into pages. The OS maintains a Page Table to map virtual pages to physical frames. These frames do not need to be contiguous.
vLLM applies this to the KV cache. Instead of one giant block, the KV cache is divided into Blocks (e.g., 16 tokens per block).
PagedAttention: How It Works
PagedAttention allows the KV tensors to be stored in non-contiguous physical memory. The algorithm works as follows:
- Logical Blocks: The KV cache for a sequence is partitioned into logical blocks.
- Block Table: A mapping identifies where each logical block resides in the physical GPU memory.
- On-Demand Allocation: As the model generates tokens, it only allocates a new physical block when the current one is full.
| Concept | OS Virtual Memory | PagedAttention |
|---|---|---|
| Memory Unit | Page | KV Block |
| Address Space | Virtual Address | Token Index |
| Mapping | Page Table | Block Table |
| Storage | Physical RAM | GPU VRAM |
Implementation Logic
Here is a simplified conceptual look at how a PagedKVManager handles allocation without requiring contiguous space:
from dataclasses import dataclass, field
from typing import List, Optional
BLOCK_SIZE = 16 # tokens per block
@dataclass
class KVBlock:
block_id: int
token_count: int = 0
class PagedKVManager:
def __init__(self, total_blocks: int):
self.free_blocks = list(range(total_blocks))
self.blocks = {i: KVBlock(block_id=i) for i in range(total_blocks)}
def allocate_block(self) -> Optional[int]:
if not self.free_blocks: return None
return self.free_blocks.pop()
def append_token(self, seq_block_table: List[int]) -> bool:
# If last block is full or no blocks exist, allocate new
if not seq_block_table or self.blocks[seq_block_table[-1]].token_count == BLOCK_SIZE:
new_id = self.allocate_block()
if new_id is None: return False
seq_block_table.append(new_id)
self.blocks[seq_block_table[-1]].token_count += 1
return True
Throughput Gains: The Numbers
By eliminating almost all fragmentation, vLLM can fit many more sequences into a single batch. In their original benchmarks, vLLM achieved:
- 2-4× higher throughput compared to HuggingFace Transformers.
- Up to 24× higher throughput in scenarios with very long sequences and high concurrency.
When you use n1n.ai, you benefit from these high-throughput architectures. Whether you are using Claude 3.5 Sonnet or OpenAI o3, the underlying infrastructure often relies on these paging principles to ensure your API calls return at lightning speed.
The Hidden Superpower: Prefix Caching
PagedAttention enables another massive optimization: Copy-on-Write (CoW). If you have 100 requests all using the same long system prompt (e.g., a complex RAG context), PagedAttention allows all 100 requests to point to the same physical blocks for that prompt.
Memory is only copied if a request deviates (like in beam search). This reduces memory usage for common prefixes to near-zero, a feature that is essential for modern AI agents.
Trade-offs and Considerations
While PagedAttention is revolutionary, it isn't a free lunch:
- Kernel Complexity: Standard FlashAttention kernels don't support paged memory. vLLM had to write custom CUDA kernels to perform attention across non-contiguous blocks.
- Overhead: There is a small CPU-side overhead for managing the block tables, though this is usually negligible compared to the GPU gains.
- Compute Bound vs. Memory Bound: If your model is already compute-bound (very large models on small batches), PagedAttention will help less than in memory-bound scenarios (many concurrent users).
Summary and Key Takeaways
- Traditional KV Cache is wasteful because it requires contiguous pre-allocation.
- PagedAttention treats GPU memory like virtual memory, dividing it into flexible blocks.
- Utilization jumps from ~60% to over 95%, allowing for much larger batches.
- Throughput increases significantly, reducing the cost per token for providers.
For developers who want to avoid the headache of managing these low-level optimizations, n1n.ai provides a unified API that aggregates the fastest, most efficient models on the market.
Get a free API key at n1n.ai