Skip to content

pmetal quantize

Quantize a model to GGUF or MLX format for efficient inference. Supports importance matrices, KL calibration, LoRA fusion before quantization, and target bits-per-weight allocation.

Terminal window
pmetal quantize \
--model <MODEL> \
--output <OUTPUT_FILE> \
[--method <QUANT_METHOD>] \
[OPTIONS]
Terminal window
# 4-bit quantization
pmetal quantize \
--model ./output \
--output model.gguf \
--method q4_k_m
# With importance matrix
pmetal quantize \
--model ./output \
--output model.gguf \
--method dynamic \
--imatrix calibration.jsonl
# Dynamic per-layer quantization
pmetal quantize \
--model ./output \
--output model.gguf \
--method dynamic
# KL-calibrated quantization (per-tensor type selection)
pmetal quantize \
--model ./output \
--output model.gguf \
--kl-calibrate --target-bpw 4.5
# MLX-format quantized export
pmetal quantize \
--model ./output \
--output ./mlx-quantized \
--format mlx \
--bits 4 --group-size 64
FormatDescription
dynamicAuto-select per layer
q8_08-bit quantization
q8_18-bit with dot-product sum helper
q6_k6-bit K-quant
q5_k_m, q5_k_s5-bit K-quant medium/small
q5_0, q5_1Legacy 5-bit GGML variants
q4_k_m, q4_k_s4-bit K-quant medium/small
q4_0, q4_1Legacy 4-bit GGML variants
q3_k_m, q3_k_s, q3_k_l3-bit K-quant variants
q2_k2-bit K-quant
q1_01-bit sign quantization
tq1_0, tq2_0Ternary quantization
mxfp4, nvfp44-bit floating-point block formats
bf16BFloat16
f16Float16
f32Float32
ParameterDefaultDescription
--modelrequiredSource model path
--outputrequiredOutput GGUF file or MLX export directory
--methoddynamicGGUF quantization method
--imatrixImportance matrix path
--loraFuse a LoRA adapter before quantizing
--kl-calibratefalseSelect per-tensor types by KL/quality threshold
--target-bpwTarget average bits per weight for calibration
--kl-threshold0.01Quality-loss threshold for calibration
--formatggufgguf or mlx
--bits4Default bit width for MLX-format quantization
--group-size64MLX-format quantization group size