Skip to content

pmetal pretrain

Run full-parameter pretraining from tokenized binary shards. This is separate from LoRA/QLoRA fine-tuning and is designed for packed text corpora.

Usage

pmetal pretrain \
  --arch <ARCH> \
  --shards <SHARD_GLOB_OR_LIST> \
  [OPTIONS]

Example

pmetal tokenize \
  --input corpus.jsonl \
  --output ./shards \
  --tokenizer Qwen/Qwen3-0.6B

pmetal pretrain \
  --arch qwen \
  --shards "./shards/*.bin" \
  --seq-len 2048 \
  --batch-size 4 \
  --steps 10000 \
  --output ./pretrain-output

Parameters

Parameter	Default	Description
`--arch`	required	Model architecture, such as `llama`, `qwen`, `gemma`, `mistral`, `phi`, or `gpt-oss`
`--shards`	required	Glob pattern or comma-separated list of `.bin` shard files
`--seq-len`	`2048`	Packed sequence length
`--batch-size`	`4`	Batch size
`--steps`	`10000`	Training steps
`--learning-rate`	`3e-4`	Peak learning rate
`--min-lr`	`1e-5`	Cosine floor learning rate
`--warmup-steps`	`1000`	Linear warmup steps
`--lr-schedule`	`cosine`	`constant`, `linear`, or `cosine`
`--weight-decay`	`0.1`	AdamW weight decay
`--max-grad-norm`	`1.0`	Gradient clipping; `0` disables
`--eos-token-id`	`0`	EOS token inserted between documents
`--output`	`./pretrain-output`	Output and checkpoint directory
`--checkpoint-every`	`1000`	Save checkpoint every N steps; `0` disables
`--resume`	—	Resume from checkpoint directory
`--model-config`	—	Model config JSON overriding architecture defaults
`--z-loss`	`0.0`	MoE router z-loss coefficient
`--gradient-accumulation-steps`	`1`	Effective batch multiplier
`--log-every`	`10`	Log every N steps
`--eval-every`	`0`	Evaluate every N steps; `0` disables
`--eval-batches`	`10`	Batches per eval round
`--seed`	`42`	Random seed

See Also

pmetal tokenize — Create pretraining shards
Training Overview — Training methods