Skip to content

pmetal distill

Distill knowledge from a larger teacher model into a smaller student model. Supports online, offline, progressive, and cross-vocabulary methods.

Terminal window
pmetal distill \
--teacher <TEACHER_MODEL> \
--student <STUDENT_MODEL> \
--dataset <DATASET> \
[OPTIONS]
Terminal window
# Online distillation (live teacher inference)
pmetal distill \
--teacher Qwen/Qwen3-4B \
--student unsloth/Qwen3.5-0.8B-Base \
--dataset train.jsonl
# Offline distillation (cached logits)
pmetal distill \
--teacher Qwen/Qwen3-4B \
--student unsloth/Qwen3.5-0.8B-Base \
--dataset train.jsonl \
--offline
# Cross-vocabulary distillation
pmetal distill \
--teacher meta-llama/Llama-3.3-70B \
--student Qwen/Qwen3-0.6B \
--dataset train.jsonl \
--cross-vocab
MethodDescription
OnlineLive teacher inference during training — highest quality, slowest
OfflinePre-cached logits with compression — faster, uses more disk
ProgressiveGradually increase distillation difficulty
Cross-vocabularyDistill between models with different tokenizers
  • KL Divergence — Standard distribution matching
  • Jensen-Shannon — Symmetric divergence
  • Soft Cross-Entropy — Temperature-scaled cross-entropy
  • TVD — Total variation distance
  • Hinge Ranking — Rank-based loss
  • Logistic Ranking — Logistic rank loss
  • MSE — Mean squared error on hidden states
  • Cosine — Cosine similarity matching
  • L1 — L1 distance