Skip to content

Quantization

PMetal provides GGUF export across modern K-quants, legacy GGML formats, BF16/F16/F32 dense formats, and newer Q1/TQ/MXFP4/NVFP4 formats. It also supports MLX-format quantized exports with target bits-per-weight allocation.

FormatDescriptionTypical Size Reduction
f32Float32 (no quantization)
f16Float16
bf16BFloat16
q8_08-bit quantization
q8_18-bit with dot-product sum helper~4×
q6_k6-bit K-quant~5×
q5_k_m, q5_k_s5-bit K-quant medium/small~6×
q5_0, q5_1Legacy 5-bit GGML variants~6×
q4_k_m, q4_k_s4-bit K-quant medium/small~8×
q4_0, q4_1Legacy 4-bit GGML variants~8×
q3_k_m, q3_k_s, q3_k_l3-bit K-quant variants~10×
q2_k2-bit K-quant~16×
q1_01-bit sign quantizationvery high
tq1_0, tq2_0Ternary GGML quantizationvery high
mxfp4, nvfp44-bit floating-point block formats~8×
dynamicImportance-matrix-guided mixed precisionvaries

Use --imatrix with a calibration dataset to preserve quality on important weights:

Terminal window
pmetal quantize \
--model ./output \
--output model.gguf \
--method q4_k_m \
--imatrix calibration.jsonl

Use KL calibration to choose per-tensor quantization types under a quality threshold or target bits-per-weight budget:

Terminal window
pmetal quantize \
--model ./output \
--output model.gguf \
--kl-calibrate \
--target-bpw 4.5

For MLX-format exports, use --format mlx with --bits and --group-size:

Terminal window
pmetal quantize \
--model ./output \
--output ./mlx-quantized \
--format mlx \
--bits 4 \
--group-size 64

For inference-time memory reduction without GGUF conversion:

Terminal window
pmetal infer --model Qwen/Qwen3-4B --fp8 --chat

Converts weights to FP8 (E4M3) at load time for approximately 2× memory reduction.

Weight quantization is separate from KV cache compression. For long-context inference and serving, PMetal also supports TurboQuant KV cache compression:

Terminal window
pmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chat
pmetal serve --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --continuous-batch

Use q3_5 for near-lossless compression or q2_5 when memory pressure is more important.