Knowledge Distillation

PMetal supports multiple distillation methods and loss functions for compressing large models into smaller, deployable ones.

Methods

Live teacher inference during training. Highest quality but slowest — both teacher and student must fit in memory.

pmetal distill \
  --teacher Qwen/Qwen3-4B \
  --student unsloth/Qwen3.5-0.8B-Base \
  --dataset train.jsonl

Pre-cache teacher logits to disk with compression. Faster training at the cost of disk space.

pmetal distill --teacher Qwen/Qwen3-4B --student Qwen/Qwen3-0.6B --dataset train.jsonl --offline

Gradually increase task difficulty during distillation for curriculum-style knowledge transfer.

ICLR 2025 SOTA method. Available via TaidDistiller in the library API.

Distill between models with different tokenizers. PMetal handles vocabulary alignment automatically.

Loss	Description
KL Divergence	Standard distribution matching
Jensen-Shannon	Symmetric divergence
Soft Cross-Entropy	Temperature-scaled cross-entropy
TVD	Total variation distance
Hinge Ranking	Rank-based loss
Logistic Ranking	Logistic rank loss

Loss	Description
MSE	Mean squared error on hidden states
Cosine	Cosine similarity matching
L1	L1 distance

For reasoning models, PMetal supports rationale distillation that preserves the thinking process, not just the final answer.