Skip to content

pmetal distill

Distill knowledge from a larger teacher model into a smaller student model. Supports online, offline, progressive, and cross-vocabulary methods.

Usage

pmetal distill \
  --teacher <TEACHER_MODEL> \
  --student <STUDENT_MODEL> \
  --dataset <DATASET> \
  [OPTIONS]

Examples

# Online distillation (live teacher inference)
pmetal distill \
  --teacher Qwen/Qwen3-4B \
  --student unsloth/Qwen3.5-0.8B-Base \
  --dataset train.jsonl

# Offline distillation (cached logits)
pmetal distill \
  --teacher Qwen/Qwen3-4B \
  --student unsloth/Qwen3.5-0.8B-Base \
  --dataset train.jsonl \
  --offline

# Cross-vocabulary distillation
pmetal distill \
  --teacher meta-llama/Llama-3.3-70B \
  --student Qwen/Qwen3-0.6B \
  --dataset train.jsonl \
  --cross-vocab

Methods

Method	Description
Online	Live teacher inference during training — highest quality, slowest
Offline	Pre-cached logits with compression — faster, uses more disk
Progressive	Gradually increase distillation difficulty
Cross-vocabulary	Distill between models with different tokenizers

Loss Functions

Token-Level

KL Divergence — Standard distribution matching
Jensen-Shannon — Symmetric divergence
Soft Cross-Entropy — Temperature-scaled cross-entropy
TVD — Total variation distance
Hinge Ranking — Rank-based loss
Logistic Ranking — Logistic rank loss

Hidden State

MSE — Mean squared error on hidden states
Cosine — Cosine similarity matching
L1 — L1 distance

See Also

Training Overview — All training methods
TAID Distillation — Advanced TAID method (library only)