pmetal dflash
Run a Qwen3 target model alongside a DFlash draft model to emit multiple tokens per forward pass. Greedy mode (--temperature 0) is intended to match the baseline output while accelerating decode.
pmetal dflash \ --target <TARGET_MODEL> \ --draft <DRAFT_MODEL> \ --prompt <PROMPT> \ [OPTIONS]Example
Section titled “Example”pmetal dflash \ --target Qwen/Qwen3-4B \ --draft z-lab/Qwen3-4B-DFlash-b16 \ --prompt "Write a concise explanation of speculative decoding" \ --max-new-tokens 128 \ --temperature 0Parameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--target | required | Target Qwen3 model ID or local path |
--draft | required | DFlash draft model ID or local path |
--prompt | required | Input prompt |
--max-new-tokens | 128 | Maximum generated tokens |
--temperature | 0.0 | Sampling temperature; 0 is greedy |
--speculative-tokens | draft default | Override draft block size |
--draft-fp8 | false | Quantize draft linear weights to FP8 on load |
--json | false | Emit JSON report instead of plain text |
--no-chat | false | Tokenize prompt verbatim without the chat template |
--tree-budget | 0 | Tree-verify budget |
See Also
Section titled “See Also”- pmetal infer — General inference
- pmetal serve — HTTP serving