The conversation around AI infrastructure focuses on FLOPS and GPU count, but in practice memory is what determines what models you can run. A 70B parameter model needs at least 140GB of GPU memory in FP16, far exceeding what a single GPU offers — and this constraint shapes nearly every infrastructure decision.

GPU compute power gets the headlines, but memory is the constraint that quietly dictates AI infrastructure choices. A model's parameters must fit in GPU memory during inference and training, plus additional memory for activations, gradients, optimizer states, and KV caches. An NVIDIA H100 has 80GB of HBM3 memory. A B200 has 192GB. A 70 billion parameter model in 16-bit precision needs 140GB just for weights, plus more for activations and KV cache during inference. So even a 70B model can't fit on a single H100 — it requires multi-GPU sharding. For training, the memory math is worse: gradients double the requirement, optimizer states (Adam) triple it again, and activation memory scales with batch size and sequence length. Training a 70B model in FP16 with Adam needs roughly 1.1TB of memory for weights, gradients, and optimizer states alone. Several techniques compensate. Quantization reduces memory: int8 halves it versus FP16, int4 quarters it. Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. ZeRO and FSDP shard optimizer states, gradients, and parameters across GPUs to fit larger models on smaller clusters. KV cache management, paged attention (vLLM), and continuous batching squeeze more inference throughput from limited memory. Specialized chips like Cerebras and Groq take radically different approaches to the memory problem, sometimes with much higher on-chip memory or unusual architectures. For infrastructure architects, understanding memory constraints first — before raw FLOPS — is what separates productive AI infrastructure from expensive bottlenecks.

Why GPU Memory is the Real Bottleneck in AI Infrastructure