Quantize a model to GGUF or MLX format for efficient inference. Supports importance matrices, KL calibration, LoRA fusion before quantization, and target bits-per-weight allocation.
[--method <QUANT_METHOD>] \
--imatrix calibration.jsonl
# Dynamic per-layer quantization
# KL-calibrated quantization (per-tensor type selection)
--kl-calibrate --target-bpw 4.5
# MLX-format quantized export
--output ./mlx-quantized \
Format Description dynamicAuto-select per layer q8_08-bit quantization q8_18-bit with dot-product sum helper q6_k6-bit K-quant q5_k_m, q5_k_s5-bit K-quant medium/small q5_0, q5_1Legacy 5-bit GGML variants q4_k_m, q4_k_s4-bit K-quant medium/small q4_0, q4_1Legacy 4-bit GGML variants q3_k_m, q3_k_s, q3_k_l3-bit K-quant variants q2_k2-bit K-quant q1_01-bit sign quantization tq1_0, tq2_0Ternary quantization mxfp4, nvfp44-bit floating-point block formats bf16BFloat16 f16Float16 f32Float32
Parameter Default Description --modelrequired Source model path --outputrequired Output GGUF file or MLX export directory --methoddynamicGGUF quantization method --imatrix— Importance matrix path --lora— Fuse a LoRA adapter before quantizing --kl-calibratefalseSelect per-tensor types by KL/quality threshold --target-bpw— Target average bits per weight for calibration --kl-threshold0.01Quality-loss threshold for calibration --formatggufgguf or mlx--bits4Default bit width for MLX-format quantization --group-size64MLX-format quantization group size