Term

Quantization formats

Different schemes for compressing LLM weights below full FP16 precision. GGUF (llama.cpp), AWQ, GPTQ, FP8, EXL2 are the common ones in 2026. Each trades quality for size and inference speed differently.

Background

Quantization formats are the encoding schemes used to store LLM weights at lower numeric precision than their native FP16 or BF16, shrinking memory footprint and often speeding inference at some cost to accuracy. Instead of sixteen bits per parameter, quantized weights are packed into eight, four, or fewer bits, typically by grouping weights and storing scale and zero-point factors so values can be reconstructed at runtime. The formats differ in how they choose those factors, how they group weights, and which runtimes consume them. GGUF is the file format used by llama.cpp and its ecosystem, packaging weights with metadata and supporting many integer quantization levels for CPU and mixed CPU/GPU inference. GPTQ and AWQ are post-training methods that produce four-bit (and other) GPU-friendly weights: GPTQ minimizes reconstruction error layer by layer using calibration data, while AWQ preserves the weights most important to activations to protect quality at low bit-widths. EXL2 is the ExLlamaV2 format, notable for mixed-precision quantization where different layers get different bit-widths. FP8 is an eight-bit floating-point format, increasingly supported in hardware, that keeps a floating representation rather than integers. These formats matter because they decide whether a given model fits on your GPU or laptop at all: a large model that needs many gigabytes at FP16 can run on consumer hardware once quantized, and the format you pick constrains which inference engine you can use and how much quality you trade for that fit.

Background

Tools that use it