pmetal infer
Run inference on a loaded model. Supports interactive chat, tool/function calling, thinking mode controls, sampling presets, multiple execution backends, FP8 weights, LoRA adapters, packed experts, KV cache quantization, TurboQuant KV compression, DFlash/speculative paths, and opt-in per-layer profiling for supported hybrid models.
pmetal infer \ --model <MODEL> \ --prompt <PROMPT> \ [OPTIONS]Examples
Section titled “Examples”# Simple generationpmetal infer --model Qwen/Qwen3-0.6B --prompt "What is 2+2?"
# Interactive chat with LoRApmetal infer \ --model Qwen/Qwen3-0.6B \ --lora ./output/lora_weights.safetensors \ --chat
# FP8 quantized inference (2× memory reduction)pmetal infer --model Qwen/Qwen3-4B --fp8 --chat
# With tool definitionspmetal infer \ --model Qwen/Qwen3-0.6B \ --tools tools.json --chat
# ANE-optimized inferencepmetal infer --model Qwen/Qwen3-0.6B --ane-max-seq-len 2048
# JIT-compiled samplingpmetal infer --model Qwen/Qwen3-0.6B --compiled --chat
# TurboQuant KV cache compressionpmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chat
# DFlash/speculative backendpmetal infer \ --model Qwen/Qwen3-4B \ --draft-model z-lab/Qwen3-4B-DFlash-b16 \ --backend dflash \ --prompt "Summarize speculative decoding"
# Profile Qwen 3.5 hybrid prefill + cached decode layers and write JSONpmetal infer \ --model unsloth/Qwen3.5-0.8B \ --prompt "write a fizzbuzz program in python" \ --chat --no-thinking --temperature 0 \ --profile-layers \ --profile-output .strategy/qwen35_layer_profile.jsonParameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--model | required | HuggingFace model ID or local path |
--prompt | required | Input prompt |
--lora | — | Path to LoRA adapter weights |
--temperature | model default | Sampling temperature |
--top-k | model default | Top-k sampling |
--top-p | model default | Nucleus sampling |
--min-p | model default | Min-p dynamic sampling |
--max-tokens | 256 | Maximum generation length |
--repetition-penalty | 1.0 | Repetition penalty |
--frequency-penalty | 0.0 | Frequency penalty |
--presence-penalty | 0.0 | Presence penalty |
--chat | false | Apply chat template |
--no-thinking | false | Disable thinking mode for models that support it |
--hide-thinking | false | Hide thinking trace from output |
--mode | auto | Model-family sampling preset |
--backend | auto | auto, standard, compiled, metal-sampler, ane, minimal, or dflash |
--draft-model | — | Draft model for speculative decoding |
--fp8 | false | FP8 weights (~2× mem reduction) |
--experts-dir | — | Packed expert weights directory for SSD-offloaded MoE inference |
--compiled | false | JIT-compiled sampling |
--profile-layers | false | Run an opt-in per-layer forward profile for supported hybrid models |
--profile-output | — | Write the layer profile report as pretty JSON |
--ane | false | Enable ANE inference when compiled with ane |
--ane-max-seq-len | 1024 | Max ANE kernel sequence length |
--ane-real-time | false | Use experimental ANE real-time evaluation path |
--tools | — | Tool definitions file (OpenAI format) |
--system | — | System message |
--kv-quant | — | KV cache bits: 8, 4, or 0 |
--kv-k-bits | — | Key-only KV cache bits for asymmetric K/V models |
--kv-v-bits | — | Value-only KV cache bits for asymmetric K/V models |
--kv-group-size | 64 | KV quantization group size |
--kv-turboquant | false | Enable TurboQuant KV compression |
--kv-turboquant-preset | — | q2_5 or q3_5 mixed-bit TurboQuant preset |
--no-kv-quant | false | Force fp16 KV cache |
--detect-repetition | false | Enable n-gram repetition loop detection |
Layer Profiling
Section titled “Layer Profiling”--profile-layers is currently implemented for standard Qwen 3.5 / qwen3_next inference. It runs one real prefill pass and one real cached decode pass using the shared inference runner, forcing MLX evaluation at each measured section so the report reflects actual wall time instead of only op scheduling overhead.
Use --profile-output <PATH> to capture the full JSON report. The CLI summary now prints:
- total layer time vs non-layer overhead
- aggregated time by layer kind (
linear_attentionvsfull_attention) - top section buckets within each kind
- the slowest individual layers and their main sections
That makes long-prompt hybrid profiles much easier to read when you are deciding whether the next prompt-heavy optimization should target GDN prefill, full-attention preparation/SDPA, sparse MoE combine, or decode-only paths.
Chat Mode
Section titled “Chat Mode”With --chat, PMetal applies the model’s chat template and starts an interactive session:
> What is quantum entanglement?Quantum entanglement is a phenomenon where two particles...
> Can you explain it more simply?Think of it like two coins that always land on opposite sides...Tool Use
Section titled “Tool Use”Pass OpenAI-format tool definitions with --tools:
[ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string" } } } } }]Supported for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek models.
See Also
Section titled “See Also”- pmetal serve — OpenAI-compatible inference server
- Rust Facade — Programmatic inference
- Python SDK — Python inference