pmetal distill
Distill knowledge from a larger teacher model into a smaller student model. Supports online, offline, progressive, and cross-vocabulary methods.
pmetal distill \ --teacher <TEACHER_MODEL> \ --student <STUDENT_MODEL> \ --dataset <DATASET> \ [OPTIONS]Examples
Section titled “Examples”# Online distillation (live teacher inference)pmetal distill \ --teacher Qwen/Qwen3-4B \ --student unsloth/Qwen3.5-0.8B-Base \ --dataset train.jsonl
# Offline distillation (cached logits)pmetal distill \ --teacher Qwen/Qwen3-4B \ --student unsloth/Qwen3.5-0.8B-Base \ --dataset train.jsonl \ --offline
# Cross-vocabulary distillationpmetal distill \ --teacher meta-llama/Llama-3.3-70B \ --student Qwen/Qwen3-0.6B \ --dataset train.jsonl \ --cross-vocabMethods
Section titled “Methods”| Method | Description |
|---|---|
| Online | Live teacher inference during training — highest quality, slowest |
| Offline | Pre-cached logits with compression — faster, uses more disk |
| Progressive | Gradually increase distillation difficulty |
| Cross-vocabulary | Distill between models with different tokenizers |
Loss Functions
Section titled “Loss Functions”Token-Level
Section titled “Token-Level”- KL Divergence — Standard distribution matching
- Jensen-Shannon — Symmetric divergence
- Soft Cross-Entropy — Temperature-scaled cross-entropy
- TVD — Total variation distance
- Hinge Ranking — Rank-based loss
- Logistic Ranking — Logistic rank loss
Hidden State
Section titled “Hidden State”- MSE — Mean squared error on hidden states
- Cosine — Cosine similarity matching
- L1 — L1 distance
See Also
Section titled “See Also”- Training Overview — All training methods
- TAID Distillation — Advanced TAID method (library only)