A 70B parameter model takes 140GB of GPU memory at full precision. Quantize it to 4-bit and it fits in 35GB — with barely any quality loss. Here's the trick that makes local AI possible.

Neural network parameters are typically stored as 32-bit or 16-bit floating point numbers. Quantization reduces the precision of those numbers — storing them as 8-bit integers, or even 4-bit integers — dramatically cutting memory and compute requirements.

**The math:**

- FP32: 4 bytes per parameter

- FP16/BF16: 2 bytes per parameter

- INT8: 1 byte per parameter

- INT4: 0.5 bytes per parameter

A 70B model at FP16 = 140GB. At INT4 = 35GB. This is why Llama 3 70B can run on a Mac Studio with 64GB of unified memory.

**Why quality doesn't collapse:**

Most parameters have a normal distribution around zero. The extremes are rare. Compressing from 32-bit to 4-bit loses some precision in those extremes, but the average case is preserved well enough that outputs remain nearly indistinguishable for most tasks.

**Key quantization techniques:**

- **GPTQ**: Post-training quantization, minimizes per-layer reconstruction error. Used offline.

- **AWQ**: Activation-Aware Weight Quantization — identifies and preserves the most important weights at higher precision. Better quality than GPTQ at same bit-width.

- **GGUF/llama.cpp**: Format for CPU-friendly quantized models. The reason LLMs run on laptops.

- **QLoRA**: Fine-tune a quantized model using low-rank adapters. Brings fine-tuning to consumer hardware.

**The frontier**: 2-bit quantization (LLM.int2) is emerging — with some quality loss but enabling massive models on tiny hardware.

**Key takeaway:** Quantization shrinks model weights from 32-bit to 4-bit, cutting memory 8x with minimal quality loss — how 70B models run on laptops.

Model Quantization: Shrinking AI Without Breaking It