Neural network parameters are typically stored as 32-bit or 16-bit floating point numbers. Quantization reduces the precision of those numbers — storing them as 8-bit integers, or even 4-bit integers — dramatically cutting memory and compute requirements.
**The math:**
- FP32: 4 bytes per parameter
- FP16/BF16: 2 bytes per parameter
- INT8: 1 byte per parameter
- INT4: 0.5 bytes per parameter
A 70B model at FP16 = 140GB. At INT4 = 35GB. This is why Llama 3 70B can run on a Mac Studio with 64GB of unified memory.
**Why quality doesn't collapse:**
Most parameters have a normal distribution around zero. The extremes are rare. Compressing from 32-bit to 4-bit loses some precision in those extremes, but the average case is preserved well enough that outputs remain nearly indistinguishable for most tasks.
**Key quantization techniques:**
- **GPTQ**: Post-training quantization, minimizes per-layer reconstruction error. Used offline.
- **AWQ**: Activation-Aware Weight Quantization — identifies and preserves the most important weights at higher precision. Better quality than GPTQ at same bit-width.
- **GGUF/llama.cpp**: Format for CPU-friendly quantized models. The reason LLMs run on laptops.
- **QLoRA**: Fine-tune a quantized model using low-rank adapters. Brings fine-tuning to consumer hardware.
**The frontier**: 2-bit quantization (LLM.int2) is emerging — with some quality loss but enabling massive models on tiny hardware.
**Key takeaway:** Quantization shrinks model weights from 32-bit to 4-bit, cutting memory 8x with minimal quality loss — how 70B models run on laptops.