pmetal pretrain
Run full-parameter pretraining from tokenized binary shards. This is separate from LoRA/QLoRA fine-tuning and is designed for packed text corpora.
pmetal pretrain \ --arch <ARCH> \ --shards <SHARD_GLOB_OR_LIST> \ [OPTIONS]Example
Section titled “Example”pmetal tokenize \ --input corpus.jsonl \ --output ./shards \ --tokenizer Qwen/Qwen3-0.6B
pmetal pretrain \ --arch qwen \ --shards "./shards/*.bin" \ --seq-len 2048 \ --batch-size 4 \ --steps 10000 \ --output ./pretrain-outputParameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--arch | required | Model architecture, such as llama, qwen, gemma, mistral, phi, or gpt-oss |
--shards | required | Glob pattern or comma-separated list of .bin shard files |
--seq-len | 2048 | Packed sequence length |
--batch-size | 4 | Batch size |
--steps | 10000 | Training steps |
--learning-rate | 3e-4 | Peak learning rate |
--min-lr | 1e-5 | Cosine floor learning rate |
--warmup-steps | 1000 | Linear warmup steps |
--lr-schedule | cosine | constant, linear, or cosine |
--weight-decay | 0.1 | AdamW weight decay |
--max-grad-norm | 1.0 | Gradient clipping; 0 disables |
--eos-token-id | 0 | EOS token inserted between documents |
--output | ./pretrain-output | Output and checkpoint directory |
--checkpoint-every | 1000 | Save checkpoint every N steps; 0 disables |
--resume | — | Resume from checkpoint directory |
--model-config | — | Model config JSON overriding architecture defaults |
--z-loss | 0.0 | MoE router z-loss coefficient |
--gradient-accumulation-steps | 1 | Effective batch multiplier |
--log-every | 10 | Log every N steps |
--eval-every | 0 | Evaluate every N steps; 0 disables |
--eval-batches | 10 | Batches per eval round |
--seed | 42 | Random seed |
See Also
Section titled “See Also”- pmetal tokenize — Create pretraining shards
- Training Overview — Training methods