Quantization

PMetal provides GGUF export across modern K-quants, legacy GGML formats, BF16/F16/F32 dense formats, and newer Q1/TQ/MXFP4/NVFP4 formats. It also supports MLX-format quantized exports with target bits-per-weight allocation.

Quantization Formats

Format	Description	Typical Size Reduction
`f32`	Float32 (no quantization)	1×
`f16`	Float16	2×
`bf16`	BFloat16	2×
`q8_0`	8-bit quantization	4×
`q8_1`	8-bit with dot-product sum helper	~4×
`q6_k`	6-bit K-quant	~5×
`q5_k_m`, `q5_k_s`	5-bit K-quant medium/small	~6×
`q5_0`, `q5_1`	Legacy 5-bit GGML variants	~6×
`q4_k_m`, `q4_k_s`	4-bit K-quant medium/small	~8×
`q4_0`, `q4_1`	Legacy 4-bit GGML variants	~8×
`q3_k_m`, `q3_k_s`, `q3_k_l`	3-bit K-quant variants	~10×
`q2_k`	2-bit K-quant	~16×
`q1_0`	1-bit sign quantization	very high
`tq1_0`, `tq2_0`	Ternary GGML quantization	very high
`mxfp4`, `nvfp4`	4-bit floating-point block formats	~8×
`dynamic`	Importance-matrix-guided mixed precision	varies

Importance Matrix

Use --imatrix with a calibration dataset to preserve quality on important weights:

pmetal quantize \
  --model ./output \
  --output model.gguf \
  --method q4_k_m \
  --imatrix calibration.jsonl

KL Calibration and MLX Export

Use KL calibration to choose per-tensor quantization types under a quality threshold or target bits-per-weight budget:

pmetal quantize \
  --model ./output \
  --output model.gguf \
  --kl-calibrate \
  --target-bpw 4.5

For MLX-format exports, use --format mlx with --bits and --group-size:

pmetal quantize \
  --model ./output \
  --output ./mlx-quantized \
  --format mlx \
  --bits 4 \
  --group-size 64

FP8 Runtime Quantization

For inference-time memory reduction without GGUF conversion:

pmetal infer --model Qwen/Qwen3-4B --fp8 --chat

Converts weights to FP8 (E4M3) at load time for approximately 2× memory reduction.

TurboQuant KV Cache

Weight quantization is separate from KV cache compression. For long-context inference and serving, PMetal also supports TurboQuant KV cache compression:

pmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chat
pmetal serve --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --continuous-batch

Use q3_5 for near-lossless compression or q2_5 when memory pressure is more important.