Skip to content

pmetal tokenize

Convert JSONL text into tokenized binary shard files for pmetal pretrain.

Terminal window
pmetal tokenize \
--input <CORPUS.jsonl> \
--output <SHARD_DIR> \
--tokenizer <MODEL_OR_TOKENIZER_PATH> \
[OPTIONS]
Terminal window
pmetal tokenize \
--input corpus.jsonl \
--output ./shards \
--tokenizer Qwen/Qwen3-0.6B \
--text-column text \
--docs-per-shard 10000
ParameterDefaultDescription
--inputrequiredInput JSONL file
--outputrequiredOutput directory for shard files
--tokenizerrequiredHuggingFace tokenizer model ID or local tokenizer path
--text-columntextJSONL field containing raw text
--docs-per-shard10000Maximum documents per shard