Skip to content

pmetal rlkd

Reinforcement Learning with Knowledge Distillation — combines GRPO with teacher distillation.

Combines GRPO policy gradient optimization with knowledge distillation from a frozen teacher model. The loss formula is L = (1 - alpha) * L_grpo + alpha * L_distill.

Terminal window
pmetal rlkd \
--model <STUDENT_MODEL> \
--teacher-model <TEACHER_MODEL> \
--dataset <DATASET> \
--output <OUTPUT_DIR> \
[OPTIONS]
Terminal window
# Basic RLKD with teacher distillation
pmetal rlkd \
--model Qwen/Qwen3-0.6B \
--teacher-model Qwen/Qwen3-4B \
--dataset reasoning.jsonl \
--output ./output/rlkd
# With annealing alpha (reduce distillation over time)
pmetal rlkd \
--model Qwen/Qwen3-0.6B \
--teacher-model Qwen/Qwen3-4B \
--dataset reasoning.jsonl \
--distill-alpha 0.5 --final-alpha 0.1 --anneal-alpha
ParameterDefaultDescription
--modelrequiredStudent/policy model ID or local path
--teacher-modelrequiredTeacher model ID or local path
--datasetrequiredTraining dataset (JSONL)
--output./output/rlkdOutput directory
--distill-alpha0.3Distillation weight (0 = pure RL, 1 = pure distillation)
--final-alpha0.05Final alpha for annealing schedule
--anneal-alphafalseEnable alpha annealing over training
--distill-temperature2.0Temperature for teacher soft targets
--num-generations8GRPO group size
--beta0.001KL penalty coefficient
--lora-r16LoRA rank
--lora-alpha32LoRA scaling factor
--learning-rate5e-6Learning rate
--max-seq-len512Maximum prompt plus completion length
--max-completion-length512Maximum completion length
--reasoning-rewardsfalseEnable reasoning-aware reward functions
--log-metricsWrite JSONL metrics