pmetal grpo
Train models for reasoning tasks using Group Relative Policy Optimization (GRPO) or Decoupled Alignment with Policy Optimization (DAPO).
pmetal grpo \ --model <MODEL> \ --dataset <DATASET> \ [OPTIONS]Examples
Section titled “Examples”# GRPO with reasoning rewardspmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --reasoning-rewards
# DAPO variantpmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --dapoDataset Format
Section titled “Dataset Format”GRPO expects a reasoning dataset:
{"problem": "What is 15 × 23?", "thinking": "15 × 23 = 15 × 20 + 15 × 3 = 300 + 45 = 345", "solution": "345"}Methods
Section titled “Methods”| Method | Description |
|---|---|
| GRPO | Group Relative Policy Optimization — samples multiple completions and optimizes relative to group reward |
| DAPO | Decoupled GRPO — separates alignment and policy optimization for more stable training |
See Also
Section titled “See Also”- Training Methods — All training method details
- pmetal train — SFT/LoRA training