vibedonaldsvibedonalds.com
Term

Quantization

Reducing the numerical precision of a model's weights (e.g. from 16-bit float to 4-bit integer) to shrink memory footprint and speed up inference, with some accuracy loss. Common formats: GGUF, AWQ, GPTQ, FP8.

Background

Quantization makes LLMs cheap enough to run locally. A 70 B model in FP16 needs ~140 GB; in Q4_K_M GGUF it fits in ~40 GB. The accuracy loss is small for 4-bit and 5-bit quantisations on most chat tasks; 2-bit is noticeably worse. llama.cpp and Ollama use GGUF by default. AWQ and GPTQ are alternatives for GPU-only inference. Quantization is the reason "run your own coding model" is viable on Apple Silicon and consumer NVIDIA cards in 2026.