Skip to content

Knowledge Distillation

PMetal supports multiple distillation methods and loss functions for compressing large models into smaller, deployable ones.

Live teacher inference during training. Highest quality but slowest — both teacher and student must fit in memory.

Terminal window
pmetal distill \
--teacher Qwen/Qwen3-4B \
--student unsloth/Qwen3.5-0.8B-Base \
--dataset train.jsonl

Pre-cache teacher logits to disk with compression. Faster training at the cost of disk space.

Terminal window
pmetal distill --teacher Qwen/Qwen3-4B --student Qwen/Qwen3-0.6B --dataset train.jsonl --offline

Gradually increase task difficulty during distillation for curriculum-style knowledge transfer.

TAID (Temporally Adaptive Interpolated Distillation)

Section titled “TAID (Temporally Adaptive Interpolated Distillation)”

ICLR 2025 SOTA method. Available via TaidDistiller in the library API.

Distill between models with different tokenizers. PMetal handles vocabulary alignment automatically.

LossDescription
KL DivergenceStandard distribution matching
Jensen-ShannonSymmetric divergence
Soft Cross-EntropyTemperature-scaled cross-entropy
TVDTotal variation distance
Hinge RankingRank-based loss
Logistic RankingLogistic rank loss
LossDescription
MSEMean squared error on hidden states
CosineCosine similarity matching
L1L1 distance

For reasoning models, PMetal supports rationale distillation that preserves the thinking process, not just the final answer.