Skip to content

pmetal grpo

Train models for reasoning tasks using Group Relative Policy Optimization (GRPO) or Decoupled Alignment with Policy Optimization (DAPO).

Terminal window
pmetal grpo \
--model <MODEL> \
--dataset <DATASET> \
[OPTIONS]
Terminal window
# GRPO with reasoning rewards
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--reasoning-rewards
# DAPO variant
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--dapo

GRPO expects a reasoning dataset:

{"problem": "What is 15 × 23?", "thinking": "15 × 23 = 15 × 20 + 15 × 3 = 300 + 45 = 345", "solution": "345"}
MethodDescription
GRPOGroup Relative Policy Optimization — samples multiple completions and optimizes relative to group reward
DAPODecoupled GRPO — separates alignment and policy optimization for more stable training