Skip to content

pmetal rlkd

Reinforcement Learning with Knowledge Distillation — combines GRPO with teacher distillation.

Combines GRPO policy gradient optimization with knowledge distillation from a frozen teacher model. The loss formula is L = (1 - alpha) * L_grpo + alpha * L_distill.

Usage

pmetal rlkd \
  --model <STUDENT_MODEL> \
  --teacher-model <TEACHER_MODEL> \
  --dataset <DATASET> \
  --output <OUTPUT_DIR> \
  [OPTIONS]

Examples

# Basic RLKD with teacher distillation
pmetal rlkd \
  --model Qwen/Qwen3-0.6B \
  --teacher-model Qwen/Qwen3-4B \
  --dataset reasoning.jsonl \
  --output ./output/rlkd

# With annealing alpha (reduce distillation over time)
pmetal rlkd \
  --model Qwen/Qwen3-0.6B \
  --teacher-model Qwen/Qwen3-4B \
  --dataset reasoning.jsonl \
  --distill-alpha 0.5 --final-alpha 0.1 --anneal-alpha

Parameters

Parameter	Default	Description
`--model`	required	Student/policy model ID or local path
`--teacher-model`	required	Teacher model ID or local path
`--dataset`	required	Training dataset (JSONL)
`--output`	`./output/rlkd`	Output directory
`--distill-alpha`	`0.3`	Distillation weight (0 = pure RL, 1 = pure distillation)
`--final-alpha`	`0.05`	Final alpha for annealing schedule
`--anneal-alpha`	`false`	Enable alpha annealing over training
`--distill-temperature`	`2.0`	Temperature for teacher soft targets
`--num-generations`	`8`	GRPO group size
`--beta`	`0.001`	KL penalty coefficient
`--lora-r`	`16`	LoRA rank
`--lora-alpha`	`32`	LoRA scaling factor
`--learning-rate`	`5e-6`	Learning rate
`--max-seq-len`	`512`	Maximum prompt plus completion length
`--max-completion-length`	`512`	Maximum completion length
`--reasoning-rewards`	`false`	Enable reasoning-aware reward functions
`--log-metrics`	—	Write JSONL metrics

See Also

Training Overview — All training methods
Knowledge Distillation — Distillation methods
GRPO Training — GRPO without distillation