Skip to content

pmetal quantize

Quantize a model to GGUF or MLX format for efficient inference. Supports importance matrices, KL calibration, LoRA fusion before quantization, and target bits-per-weight allocation.

Usage

pmetal quantize \
  --model <MODEL> \
  --output <OUTPUT_FILE> \
  [--method <QUANT_METHOD>] \
  [OPTIONS]

Examples

# 4-bit quantization
pmetal quantize \
  --model ./output \
  --output model.gguf \
  --method q4_k_m

# With importance matrix
pmetal quantize \
  --model ./output \
  --output model.gguf \
  --method dynamic \
  --imatrix calibration.jsonl

# Dynamic per-layer quantization
pmetal quantize \
  --model ./output \
  --output model.gguf \
  --method dynamic

# KL-calibrated quantization (per-tensor type selection)
pmetal quantize \
  --model ./output \
  --output model.gguf \
  --kl-calibrate --target-bpw 4.5

# MLX-format quantized export
pmetal quantize \
  --model ./output \
  --output ./mlx-quantized \
  --format mlx \
  --bits 4 --group-size 64

Quantization Types

Format	Description
`dynamic`	Auto-select per layer
`q8_0`	8-bit quantization
`q8_1`	8-bit with dot-product sum helper
`q6_k`	6-bit K-quant
`q5_k_m`, `q5_k_s`	5-bit K-quant medium/small
`q5_0`, `q5_1`	Legacy 5-bit GGML variants
`q4_k_m`, `q4_k_s`	4-bit K-quant medium/small
`q4_0`, `q4_1`	Legacy 4-bit GGML variants
`q3_k_m`, `q3_k_s`, `q3_k_l`	3-bit K-quant variants
`q2_k`	2-bit K-quant
`q1_0`	1-bit sign quantization
`tq1_0`, `tq2_0`	Ternary quantization
`mxfp4`, `nvfp4`	4-bit floating-point block formats
`bf16`	BFloat16
`f16`	Float16
`f32`	Float32

Parameters

Parameter	Default	Description
`--model`	required	Source model path
`--output`	required	Output GGUF file or MLX export directory
`--method`	`dynamic`	GGUF quantization method
`--imatrix`	—	Importance matrix path
`--lora`	—	Fuse a LoRA adapter before quantizing
`--kl-calibrate`	`false`	Select per-tensor types by KL/quality threshold
`--target-bpw`	—	Target average bits per weight for calibration
`--kl-threshold`	`0.01`	Quality-loss threshold for calibration
`--format`	`gguf`	`gguf` or `mlx`
`--bits`	`4`	Default bit width for MLX-format quantization
`--group-size`	`64`	MLX-format quantization group size

See Also

Quantization — Detailed quantization guide