Term
Quantization formats
Different schemes for compressing LLM weights below full FP16 precision. GGUF (llama.cpp), AWQ, GPTQ, FP8, EXL2 are the common ones in 2026. Each trades quality for size and inference speed differently.
Background
GGUF dominates CPU/Apple-Silicon inference and is the format Ollama and LM Studio use. AWQ and GPTQ target NVIDIA GPUs with TensorRT-LLM or vLLM. FP8 is the newest, supported natively on H100/H200/B200 hardware with minimal quality loss. EXL2 is favoured for very low-bit (2-3 bit) quantisations in the ExLlamaV2 community.