Training Methods
Supervised Fine-Tuning (SFT)
Section titled “Supervised Fine-Tuning (SFT)”Standard fine-tuning on instruction/response pairs. Used via pmetal train or easy::finetune().
Low-Rank Adaptation — trains small adapter matrices instead of full weights. Parameters:
- rank (
--lora-r): Adapter rank (default: 16) - alpha (
--lora-alpha): Scaling factor (default: 2× rank)
4-bit quantized LoRA. Loads base model in NF4/FP4/INT8, trains adapters in full precision.
pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --quantization nf4Weight-Decomposed LoRA — decomposes weight updates into magnitude and direction for better training stability.
pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --doraPreference Optimization
Section titled “Preference Optimization”DPO (Direct Preference Optimization)
Section titled “DPO (Direct Preference Optimization)”Trains on preference pairs (chosen/rejected) without a reward model.
easy::dpo("model", "preferences.jsonl") .dpo_beta(0.1) .reference_model("model") .run().await?;SimPO (Simple Preference Optimization)
Section titled “SimPO (Simple Preference Optimization)”Simplified DPO without a reference model.
ORPO (Odds-Ratio Preference Optimization)
Section titled “ORPO (Odds-Ratio Preference Optimization)”Combines SFT and preference optimization in a single stage.
KTO (Kahneman-Tversky Optimization)
Section titled “KTO (Kahneman-Tversky Optimization)”Preference optimization using prospect theory — works with binary feedback (good/bad) instead of pairwise comparisons.
Reasoning Training
Section titled “Reasoning Training”GRPO (Group Relative Policy Optimization)
Section titled “GRPO (Group Relative Policy Optimization)”Samples multiple completions per prompt, scores them with reward functions, and optimizes policy relative to group performance.
pmetal grpo --model Qwen/Qwen3-0.6B --dataset reasoning.jsonl --reasoning-rewardsDAPO (Decoupled Alignment with Policy Optimization)
Section titled “DAPO (Decoupled Alignment with Policy Optimization)”Decouples the alignment and policy optimization steps for more stable reasoning training.
ANE Training
Section titled “ANE Training”Automatic Apple Neural Engine training when available. Uses the ANE for forward passes with CPU-based gradient computation. Activated automatically on supported models.
See Also
Section titled “See Also”- Training Overview — Method availability matrix
- Distillation — Knowledge distillation methods