Skip to content

Training Methods

Standard fine-tuning on instruction/response pairs. Used via pmetal train or easy::finetune().

Low-Rank Adaptation — trains small adapter matrices instead of full weights. Parameters:

  • rank (--lora-r): Adapter rank (default: 16)
  • alpha (--lora-alpha): Scaling factor (default: 2× rank)

4-bit quantized LoRA. Loads base model in NF4/FP4/INT8, trains adapters in full precision.

Terminal window
pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --quantization nf4

Weight-Decomposed LoRA — decomposes weight updates into magnitude and direction for better training stability.

Terminal window
pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --dora

Trains on preference pairs (chosen/rejected) without a reward model.

easy::dpo("model", "preferences.jsonl")
.dpo_beta(0.1)
.reference_model("model")
.run().await?;

Simplified DPO without a reference model.

Combines SFT and preference optimization in a single stage.

Preference optimization using prospect theory — works with binary feedback (good/bad) instead of pairwise comparisons.

Samples multiple completions per prompt, scores them with reward functions, and optimizes policy relative to group performance.

Terminal window
pmetal grpo --model Qwen/Qwen3-0.6B --dataset reasoning.jsonl --reasoning-rewards

DAPO (Decoupled Alignment with Policy Optimization)

Section titled “DAPO (Decoupled Alignment with Policy Optimization)”

Decouples the alignment and policy optimization steps for more stable reasoning training.

Automatic Apple Neural Engine training when available. Uses the ANE for forward passes with CPU-based gradient computation. Activated automatically on supported models.