Quantization
PMetal provides GGUF export across modern K-quants, legacy GGML formats, BF16/F16/F32 dense formats, and newer Q1/TQ/MXFP4/NVFP4 formats. It also supports MLX-format quantized exports with target bits-per-weight allocation.
Quantization Formats
Section titled “Quantization Formats”| Format | Description | Typical Size Reduction |
|---|---|---|
f32 | Float32 (no quantization) | 1× |
f16 | Float16 | 2× |
bf16 | BFloat16 | 2× |
q8_0 | 8-bit quantization | 4× |
q8_1 | 8-bit with dot-product sum helper | ~4× |
q6_k | 6-bit K-quant | ~5× |
q5_k_m, q5_k_s | 5-bit K-quant medium/small | ~6× |
q5_0, q5_1 | Legacy 5-bit GGML variants | ~6× |
q4_k_m, q4_k_s | 4-bit K-quant medium/small | ~8× |
q4_0, q4_1 | Legacy 4-bit GGML variants | ~8× |
q3_k_m, q3_k_s, q3_k_l | 3-bit K-quant variants | ~10× |
q2_k | 2-bit K-quant | ~16× |
q1_0 | 1-bit sign quantization | very high |
tq1_0, tq2_0 | Ternary GGML quantization | very high |
mxfp4, nvfp4 | 4-bit floating-point block formats | ~8× |
dynamic | Importance-matrix-guided mixed precision | varies |
Importance Matrix
Section titled “Importance Matrix”Use --imatrix with a calibration dataset to preserve quality on important weights:
pmetal quantize \ --model ./output \ --output model.gguf \ --method q4_k_m \ --imatrix calibration.jsonlKL Calibration and MLX Export
Section titled “KL Calibration and MLX Export”Use KL calibration to choose per-tensor quantization types under a quality threshold or target bits-per-weight budget:
pmetal quantize \ --model ./output \ --output model.gguf \ --kl-calibrate \ --target-bpw 4.5For MLX-format exports, use --format mlx with --bits and --group-size:
pmetal quantize \ --model ./output \ --output ./mlx-quantized \ --format mlx \ --bits 4 \ --group-size 64FP8 Runtime Quantization
Section titled “FP8 Runtime Quantization”For inference-time memory reduction without GGUF conversion:
pmetal infer --model Qwen/Qwen3-4B --fp8 --chatConverts weights to FP8 (E4M3) at load time for approximately 2× memory reduction.
TurboQuant KV Cache
Section titled “TurboQuant KV Cache”Weight quantization is separate from KV cache compression. For long-context inference and serving, PMetal also supports TurboQuant KV cache compression:
pmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chatpmetal serve --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --continuous-batchUse q3_5 for near-lossless compression or q2_5 when memory pressure is more important.
See Also
Section titled “See Also”- pmetal quantize — CLI reference
- Supported Models — Compatible architectures