Skip to content

Quantization

PMetal provides GGUF quantization with 13 format options and importance matrix support for quality-preserving compression.

FormatDescriptionTypical Size Reduction
f32Float32 (no quantization)
f16Float16
q8_08-bit quantization
q6k6-bit k-quant~5×
q5km5-bit k-quant (medium)~6×
q5ks5-bit k-quant (small)~6×
q4km4-bit k-quant (medium)~8×
q4ks4-bit k-quant (small)~8×
q3km3-bit k-quant (medium)~10×
q3ks3-bit k-quant (small)~10×
q3kl3-bit k-quant (large)~10×
q2k2-bit k-quant~16×
dynamicAuto-select per layervaries

Use --imatrix with a calibration dataset to preserve quality on important weights:

Terminal window
pmetal quantize \
--model ./output \
--output model.gguf \
--type q4km \
--imatrix calibration.jsonl

For inference-time memory reduction without GGUF conversion:

Terminal window
pmetal infer --model Qwen/Qwen3-4B --fp8 --chat

Converts to FP8 (E4M3) at load time for approximately 2× memory reduction.