Training Overview
PMetal supports 12+ training methods across CLI, GUI, TUI, and the Rust/Python SDK. All methods support callback-based cancellation, JSONL metrics logging, and adaptive learning rate control.
Method Matrix
Section titled “Method Matrix”| Method | CLI | GUI | TUI | Library |
|---|---|---|---|---|
| SFT (Supervised Fine-Tuning) | train | Yes | Yes | easy::finetune() |
| LoRA | train | Yes | Yes | easy::finetune() |
| QLoRA (4-bit) | train --quantization nf4 | Yes | Yes | easy::finetune() |
| DoRA | train --dora | Yes | Yes | easy::finetune() |
| DPO (Direct Preference) | — | — | — | easy::dpo() |
| SimPO | — | — | — | easy::simpo() |
| ORPO | — | — | — | easy::orpo() |
| KTO | — | — | — | easy::kto() |
| GRPO (Reasoning) | grpo | Yes | Yes | GrpoTrainer |
| DAPO (Decoupled GRPO) | grpo --dapo | Yes | Yes | DapoTrainer |
| Knowledge Distillation | distill | Yes | Yes | Distiller |
| TAID | — | — | — | TaidDistiller |
| ANE Training | train (auto) | — | Yes | AneTrainingLoop |
Library-only methods: GSPO, PPO, Online DPO, Diffusion Training.
Training Infrastructure
Section titled “Training Infrastructure”Sequence Packing
Section titled “Sequence Packing”Packs multiple sequences into single batches for 2–5× throughput. Enabled by default with proper attention masking.
Gradient Checkpointing
Section titled “Gradient Checkpointing”Trade compute for memory on large models. Configurable layer grouping (default: 4 layers per block).
Adaptive Learning Rate
Section titled “Adaptive Learning Rate”EMA-based anomaly detection with automatic spike recovery, plateau reduction, and divergence detection.
Optimizers
Section titled “Optimizers”| Optimizer | Description |
|---|---|
| AdamW | Standard with configurable weight decay |
| Metal Fused AdamW | GPU-accelerated parameter updates |
| Schedule-Free | Memory-efficient, no LR schedule needed |
| 8-bit Adam | Memory-efficient for large models |
| LoRA+ | Differentiated LR for A and B matrices |
LR Schedules
Section titled “LR Schedules”constant, linear, cosine, cosine_with_restarts, polynomial, wsd
Additional Features
Section titled “Additional Features”- NEFTune — Noise-augmented fine-tuning for improved generation quality
- Checkpoint Management — Save/resume with best-loss rollback
- Tool/Function Calling — Chat templates with native tool definitions
- Distributed Training — mDNS auto-discovery, Ring All-Reduce
Dataset Formats
Section titled “Dataset Formats”Auto-detected:
| Format | Structure |
|---|---|
| ShareGPT | {"conversations": [{"from": "human", "value": "..."}]} |
| Alpaca | {"instruction": "...", "input": "...", "output": "..."} |
| OpenAI/Messages | {"messages": [{"role": "user", "content": "..."}]} |
| Reasoning | {"problem": "...", "thinking": "...", "solution": "..."} |
| Simple | {"text": "..."} |
| Parquet | Standard text columns or reasoning formats |
See Also
Section titled “See Also”- Training Methods — Detailed method descriptions
- Distillation — Knowledge distillation deep dive
- pmetal train — CLI training parameters