pmetal tokenize

Convert JSONL text into tokenized binary shard files for pmetal pretrain.

Usage

pmetal tokenize \
  --input <CORPUS.jsonl> \
  --output <SHARD_DIR> \
  --tokenizer <MODEL_OR_TOKENIZER_PATH> \
  [OPTIONS]

pmetal tokenize \
  --input corpus.jsonl \
  --output ./shards \
  --tokenizer Qwen/Qwen3-0.6B \
  --text-column text \
  --docs-per-shard 10000

Parameter	Default	Description
`--input`	required	Input JSONL file
`--output`	required	Output directory for shard files
`--tokenizer`	required	HuggingFace tokenizer model ID or local tokenizer path
`--text-column`	`text`	JSONL field containing raw text
`--docs-per-shard`	`10000`	Maximum documents per shard