pmetal rlkd
Reinforcement Learning with Knowledge Distillation — combines GRPO with teacher distillation.
Combines GRPO policy gradient optimization with knowledge distillation from a frozen teacher model. The loss formula is L = (1 - alpha) * L_grpo + alpha * L_distill.
pmetal rlkd \ --model <STUDENT_MODEL> \ --teacher-model <TEACHER_MODEL> \ --dataset <DATASET> \ --output <OUTPUT_DIR> \ [OPTIONS]Examples
Section titled “Examples”# Basic RLKD with teacher distillationpmetal rlkd \ --model Qwen/Qwen3-0.6B \ --teacher-model Qwen/Qwen3-4B \ --dataset reasoning.jsonl \ --output ./output/rlkd
# With annealing alpha (reduce distillation over time)pmetal rlkd \ --model Qwen/Qwen3-0.6B \ --teacher-model Qwen/Qwen3-4B \ --dataset reasoning.jsonl \ --distill-alpha 0.5 --final-alpha 0.1 --anneal-alphaParameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--model | required | Student/policy model ID or local path |
--teacher-model | required | Teacher model ID or local path |
--dataset | required | Training dataset (JSONL) |
--output | ./output/rlkd | Output directory |
--distill-alpha | 0.3 | Distillation weight (0 = pure RL, 1 = pure distillation) |
--final-alpha | 0.05 | Final alpha for annealing schedule |
--anneal-alpha | false | Enable alpha annealing over training |
--distill-temperature | 2.0 | Temperature for teacher soft targets |
--num-generations | 8 | GRPO group size |
--beta | 0.001 | KL penalty coefficient |
--lora-r | 16 | LoRA rank |
--lora-alpha | 32 | LoRA scaling factor |
--learning-rate | 5e-6 | Learning rate |
--max-seq-len | 512 | Maximum prompt plus completion length |
--max-completion-length | 512 | Maximum completion length |
--reasoning-rewards | false | Enable reasoning-aware reward functions |
--log-metrics | — | Write JSONL metrics |
See Also
Section titled “See Also”- Training Overview — All training methods
- Knowledge Distillation — Distillation methods
- GRPO Training — GRPO without distillation