pmetal grpo
GRPO and DAPO reasoning training with reward functions and sampling.
Train models for reasoning tasks using Group Relative Policy Optimization (GRPO) or Decoupled Alignment with Policy Optimization (DAPO).
pmetal grpo \ --model <MODEL> \ --dataset <DATASET> \ [OPTIONS]Examples
Section titled “Examples”# GRPO with reasoning rewardspmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --reasoning-rewards
# DAPO variantpmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --dapo
# With speculative decoding (2-4× faster rollouts)pmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --speculative --speculative-draft-tokens 3
# VLM mode with image inputspmetal grpo \ --model Qwen/Qwen2-VL-2B \ --dataset vlm_reasoning.jsonl \ --vlm --max-image-size 512
# ML reward model scoringpmetal grpo \ --model Qwen/Qwen3-0.6B \ --dataset reasoning.jsonl \ --reward-model reward-model-path \ --reward-model-weight 0.5 --async-rewardsDataset Format
Section titled “Dataset Format”GRPO expects a reasoning dataset:
{"problem": "What is 15 × 23?", "thinking": "15 × 23 = 15 × 20 + 15 × 3 = 300 + 45 = 345", "solution": "345"}Parameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--model | required | Model ID or local path |
--dataset | required | Reasoning dataset (JSONL) |
--dapo | false | Use DAPO variant |
--reasoning-rewards | false | Enable reasoning-aware rewards |
--speculative | false | Speculative decoding for faster rollouts |
--speculative-draft-tokens | 3 | Draft tokens per speculative step |
--vlm | false | Vision-Language Model mode |
--max-image-size | 336 | Max image dimension for VLM |
--reward-model | — | Pretrained reward model path/ID |
--reward-model-max-length | 2048 | Max reward-model input tokens |
--reward-model-weight | 1.0 | Weight for ML reward model scores |
--grpo-kv-bits | — | KV cache quantization bits for rollout generation |
--log-metrics | — | Write JSONL metrics |
--async-rewards | false | Background reward scoring |
Methods
Section titled “Methods”| Method | Description |
|---|---|
| GRPO | Group Relative Policy Optimization — samples multiple completions and optimizes relative to group reward |
| DAPO | Decoupled GRPO — separates alignment and policy optimization for more stable training |
See Also
Section titled “See Also”- Training Methods — All training method details
- pmetal train — SFT/LoRA training