pmetal tokenize
Convert JSONL text into tokenized binary shard files for pmetal pretrain.
pmetal tokenize \ --input <CORPUS.jsonl> \ --output <SHARD_DIR> \ --tokenizer <MODEL_OR_TOKENIZER_PATH> \ [OPTIONS]Example
Section titled “Example”pmetal tokenize \ --input corpus.jsonl \ --output ./shards \ --tokenizer Qwen/Qwen3-0.6B \ --text-column text \ --docs-per-shard 10000Parameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--input | required | Input JSONL file |
--output | required | Output directory for shard files |
--tokenizer | required | HuggingFace tokenizer model ID or local tokenizer path |
--text-column | text | JSONL field containing raw text |
--docs-per-shard | 10000 | Maximum documents per shard |
See Also
Section titled “See Also”- pmetal pretrain — Train from binary shards
- Dataset Utilities — Dataset conversion and validation