Loading...

Context Windows Explained: Why Your LLM’s “Memory” Costs More Than You Think

Takeaways

Every LLM provider now brags about context window size like it’s a horsepower number. Claude sits at 1M tokens, Gemini 3 Pro supports 2M, Llama 4 Scout claims 10M. The marketing implies bigger is better — more context, more intelligence, more capability. The engineering reality is more complicated and a lot more expensive. Context length isn’t free. It’s not even cheap. At the scales modern models now advertise, it’s quietly reshaping the economics of AI serving in ways most product decisions still ignore.

This article walks through what an LLM actually does when you hand it a prompt, what a context window is underneath the marketing, how its length drags on compute and memory, and where short, medium, and long context each make sense — with the hardware you’d need to run them.

Posted: 4/19/2026
Avatar
Author
Gal Ratner

How an LLM Actually Works

Strip away the chatbot UI and an LLM is doing something surprisingly mechanical.

Tokenization. Text goes in. The first step breaks it into tokens — sub-word units the model was trained to recognize. A token averages roughly four characters or three-quarters of a word in English. Code tokenizes less efficiently, closer to two tokens per word, because identifiers, operators, and whitespace are treated differently. CJK languages consume 2–8x more tokens than English for equivalent content. Different models use different tokenizers, so 1,000 tokens of GPT-5 input is not the same string as 1,000 tokens of Claude input.

Embedding. Each token is converted into a high-dimensional vector — typically 4,096 to 16,384 dimensions in modern frontier models. These are the numbers the neural network actually operates on.

Transformer layers. The embeddings flow through dozens or hundreds of transformer layers. Each layer has two main components: a self-attention block and a feed-forward network. The self-attention block is the centerpiece. For each token, the model computes a query, a key, and a value — three projections of that token’s embedding. To decide what that token should “attend to,” the model takes the dot product of its query with the keys of every other token in the context. The resulting attention scores pick which values get blended into the token’s new representation. This is how a transformer knows that “it” in a sentence refers to “the dog” six words earlier. This is also where the cost blows up, as we’ll see.

Autoregressive generation. LLMs generate one token at a time. The model computes a probability distribution over the entire vocabulary for the next token, samples (or greedy-picks) one, appends it to the sequence, and runs another forward pass for the token after that. Every generated token triggers a fresh computation across every layer.

The part that matters: to generate a single token, the model has to look at every token already in the context.

What the Context Window Actually Is

The context window is the maximum number of tokens the model can process in a single forward pass — input plus whatever the model has generated so far in its response. It’s the hard ceiling on working memory. Send more than the limit and the earliest tokens get dropped, or the request gets rejected outright.

But the “working memory” metaphor misses something important. What actually sits in GPU memory during inference isn’t really the tokens — it’s the KV cache.

Recall that self-attention computes keys and values for every token. For autoregressive generation, those keys and values don’t change once computed — they’re deterministic projections of tokens already in the sequence. So rather than recompute them at every generation step, inference servers cache them. The query for the new token gets freshly computed; the keys and values for everything before it get pulled from cache.

This cache is the single biggest memory cost in LLM inference at long context. The formula is simple: KV cache size per token = 2 × num_layers × num_heads × head_dim × precision_in_bytes. The factor of 2 is for keys and values. For Llama 3 70B, this works out to roughly 40 GB of KV cache for a single 128K-token context at FP16 precision — and that’s per user. Multiply by a batch of concurrent users and you see why a model with a 140 GB weight footprint can demand several hundred GB of GPU memory to actually serve.

Crucially, the KV cache grows linearly with context length. Attention compute grows quadratically during prefill (processing the initial prompt), because every token attends to every other token. During decode (generating new tokens), attention compute grows linearly with context length — because each new token only attends backward, and the cache spares us from recomputing old keys and values. The quadratic cost shows up as a wall of latency on the first token; the linear cost shows up as a wall of GPU memory for the entire session.

What Changes When You Scale Context

Four things get worse as context grows, and they get worse at different rates.

Time to first token (TTFT). The prefill pass processes the full prompt in one shot. Because attention is quadratic in sequence length during prefill, doubling the context does more than double the time before the model emits its first word. This is why a 200K-token prompt feels sluggish even on an H100 — the GPU is burning through trillions of attention operations before the model says anything at all.

Memory pressure. KV cache grows linearly with context, but it grows per user. A production serving system running at any reasonable concurrency needs enough VRAM for weights plus KV cache across the entire batch. For a 70B model serving 32 concurrent users at 8K context, KV cache alone can consume 40–50 GB — often exceeding the weights themselves.

Quality degradation. Models don’t use long contexts uniformly. The “lost in the middle” effect is real and well-benchmarked: models reliably recall information near the start and end of their context window, but hit rates on middle-positioned content drop measurably. Most commercial models advertising 200K tokens become noticeably less reliable well before that limit — typically around the 60–70% mark of advertised capacity.

Economics. Anthropic and Google both apply pricing surcharges above 200K tokens (typically 2x on input for the portion above the threshold), because the underlying serving cost genuinely doubles. If your product pushes long context by default, you’re paying that premium on every request.

There’s also a security dimension. Longer context means longer attack surface. Anthropic’s own research has shown that extended context windows increase vulnerability to “many-shot” jailbreaking, where an adversary fills the context with adversarial examples to steer the model toward prohibited output. This is not theoretical. It scales with the length you offer.

Scenario One: Short Context (4K–32K tokens)

The overwhelming majority of real LLM traffic fits here. Customer-service bots, code completion, classification, translation, intent detection, short-form content generation, routine agentic sub-tasks. A developer’s IDE autocomplete rarely needs to see more than the current file and a few neighbors. A chatbot answering a product question doesn’t need a library.

Hardware. This is where inference can run on commodity hardware. A single RTX 4090 (24 GB) or RTX 5090 (32 GB) can serve a quantized 7B–13B model at 8K–32K context for a small number of concurrent users. A single H100 80GB handles a quantized 70B model at these lengths comfortably. Locally-hosted setups using Ollama or llama.cpp on consumer hardware hit their sweet spot at 8K–16K with 7B–14B models. For a .NET developer running Semantic Kernel or Microsoft.Extensions.AI against a local Ollama endpoint, this is the tier that actually works on a developer workstation without renting cloud GPUs.

The engineering advice here is blunt: if your workload fits in short context, don’t pay for long context. Use RAG to pull only the relevant chunks into a 16K window rather than shoving entire documents into a 1M window. The retrieval-plus-short-context pattern is almost always cheaper, faster, and more accurate than long-context-everywhere.

Scenario Two: Medium Context (64K–200K tokens)

This is the sweet spot for knowledge work. Full document QA (a 50-page PDF is roughly 100K tokens). Meeting transcript analysis. Code review across a handful of related files. Legal contract analysis. Research synthesis over a few papers. Multi-turn conversations that carry meaningful history. This is where Claude Sonnet 4.6’s standard 200K, GPT-5.4’s 272K, and Gemini 3 Flash’s 1M (used at medium length) are all competitive.

Hardware. This tier is firmly in data-center territory. A single H100 80GB can serve a 70B-class model at 128K context for a small batch. More realistic production deployments run on H100/H200 pairs, or on NVIDIA GH200 Grace Hopper Superchips — the latter couples 96 GB of HBM3 GPU memory with 480 GB of LPDDR CPU memory over a high-bandwidth NVLink-C2C interconnect, letting the KV cache spill into CPU RAM without the crushing PCIe penalty of a traditional host/device split. For self-hosted enterprise deployments, this is where infrastructure decisions start getting serious. You’re not running this on a workstation.

This is also the tier where inference optimizations stop being optional. PagedAttention (from vLLM) reduces KV cache fragmentation from 60–80% waste down to under 4%, roughly tripling serving throughput at these context lengths. FlashAttention makes the quadratic prefill computation IO-aware and dramatically faster. Grouped-Query Attention (GQA), now standard in Llama and Mistral architectures, shrinks the KV cache by sharing keys and values across groups of attention heads. None of this is exotic — it’s the baseline stack anyone running production inference at medium context is already using.

Scenario Three: Long Context (500K–10M+ tokens)

This is the frontier tier and the one most commonly oversold. Legitimate use cases exist: analyzing an entire enterprise codebase in one pass; multi-document synthesis across dozens of research papers; processing a full novel or technical book; genomics; log-analysis workloads where temporal context genuinely spans millions of tokens. Llama 4 Scout’s 10M window and Gemini 3 Pro’s 2M are the current high-water marks.

Hardware. At this tier you stop talking about a server and start talking about a rack. An NVL72-class system — 72 Blackwell GPUs interconnected with NVLink — is the class of infrastructure that makes million-token, multi-user inference viable at acceptable latency. For self-hosting, you’re looking at multi-node tensor-parallel and sequence-parallel configurations, typically 8x H100/H200 at minimum for a large model at long context, with KV cache offloading to CPU memory or NVMe as a matter of course. LMCache, PagedEviction, and layer-ahead CPU pre-computation frameworks like ScoutAttention are the tools being used in production to survive this regime.

There’s a harder question underneath the hardware math, and it’s worth saying plainly: the transformer’s quadratic attention is a structural cost that no optimization fully fixes. FlashAttention changes the constant factors. PagedAttention changes the fragmentation. Quantization to FP8 or NVFP4 halves the memory footprint. None of it changes the fundamental scaling. This is the argument for state-space models — Mamba-3 and its successors — and other sub-quadratic architectures. They trade some raw quality for context lengths that scale linearly in compute and memory. For anyone betting on 10M-token workflows as a steady state, the question worth asking is whether the transformer is the right substrate at all, or whether the industry’s going to discover in 2027 that we scaled the wrong architecture into the wall.

The Takeaway

The context window is not the model’s memory. It’s a working buffer whose cost is dominated by the KV cache, whose compute is dominated by quadratic attention at prefill, and whose quality degrades well before the advertised limit. Bigger context windows are genuinely useful for a narrow band of workloads — codebase analysis, long-document synthesis, deep multi-turn reasoning. For everything else, they’re an expensive luxury that a well-tuned RAG pipeline would handle better, faster, and cheaper.

When you’re picking a model, don’t pick by advertised window size. Pick the smallest context that actually covers your workload, and the cheapest hardware tier that serves it at your throughput and latency targets. Every token above that is money you’re burning to look impressive on a marketing slide.


Related Tags:

No Comments Yet.

Leave a Comment
Top