Skip to content

pmetal dflash

Run a Qwen3 target model alongside a DFlash draft model to emit multiple tokens per forward pass. Greedy mode (--temperature 0) is intended to match the baseline output while accelerating decode.

Terminal window
pmetal dflash \
--target <TARGET_MODEL> \
--draft <DRAFT_MODEL> \
--prompt <PROMPT> \
[OPTIONS]
Terminal window
pmetal dflash \
--target Qwen/Qwen3-4B \
--draft z-lab/Qwen3-4B-DFlash-b16 \
--prompt "Write a concise explanation of speculative decoding" \
--max-new-tokens 128 \
--temperature 0
ParameterDefaultDescription
--targetrequiredTarget Qwen3 model ID or local path
--draftrequiredDFlash draft model ID or local path
--promptrequiredInput prompt
--max-new-tokens128Maximum generated tokens
--temperature0.0Sampling temperature; 0 is greedy
--speculative-tokensdraft defaultOverride draft block size
--draft-fp8falseQuantize draft linear weights to FP8 on load
--jsonfalseEmit JSON report instead of plain text
--no-chatfalseTokenize prompt verbatim without the chat template
--tree-budget0Tree-verify budget