Skip to content

pmetal grpo

GRPO and DAPO reasoning training with reward functions and sampling.

Train models for reasoning tasks using Group Relative Policy Optimization (GRPO) or Decoupled Alignment with Policy Optimization (DAPO).

Terminal window
pmetal grpo \
--model <MODEL> \
--dataset <DATASET> \
[OPTIONS]
Terminal window
# GRPO with reasoning rewards
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--reasoning-rewards
# DAPO variant
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--dapo
# With speculative decoding (2-4× faster rollouts)
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--speculative --speculative-draft-tokens 3
# VLM mode with image inputs
pmetal grpo \
--model Qwen/Qwen2-VL-2B \
--dataset vlm_reasoning.jsonl \
--vlm --max-image-size 512
# ML reward model scoring
pmetal grpo \
--model Qwen/Qwen3-0.6B \
--dataset reasoning.jsonl \
--reward-model reward-model-path \
--reward-model-weight 0.5 --async-rewards

GRPO expects a reasoning dataset:

{"problem": "What is 15 × 23?", "thinking": "15 × 23 = 15 × 20 + 15 × 3 = 300 + 45 = 345", "solution": "345"}
ParameterDefaultDescription
--modelrequiredModel ID or local path
--datasetrequiredReasoning dataset (JSONL)
--dapofalseUse DAPO variant
--reasoning-rewardsfalseEnable reasoning-aware rewards
--speculativefalseSpeculative decoding for faster rollouts
--speculative-draft-tokens3Draft tokens per speculative step
--vlmfalseVision-Language Model mode
--max-image-size336Max image dimension for VLM
--reward-modelPretrained reward model path/ID
--reward-model-max-length2048Max reward-model input tokens
--reward-model-weight1.0Weight for ML reward model scores
--grpo-kv-bitsKV cache quantization bits for rollout generation
--log-metricsWrite JSONL metrics
--async-rewardsfalseBackground reward scoring
MethodDescription
GRPOGroup Relative Policy Optimization — samples multiple completions and optimizes relative to group reward
DAPODecoupled GRPO — separates alignment and policy optimization for more stable training