pmetal infer

Run inference on a loaded model. Supports interactive chat, tool/function calling, thinking mode controls, sampling presets, multiple execution backends, FP8 weights, LoRA adapters, packed experts, KV cache quantization, TurboQuant KV compression, DFlash/speculative paths, and opt-in per-layer profiling for supported hybrid models.

Usage

pmetal infer \
  --model <MODEL> \
  --prompt <PROMPT> \
  [OPTIONS]

Examples

# Simple generation
pmetal infer --model Qwen/Qwen3-0.6B --prompt "What is 2+2?"

# Interactive chat with LoRA
pmetal infer \
  --model Qwen/Qwen3-0.6B \
  --lora ./output/lora_weights.safetensors \
  --chat

# FP8 quantized inference (2× memory reduction)
pmetal infer --model Qwen/Qwen3-4B --fp8 --chat

# With tool definitions
pmetal infer \
  --model Qwen/Qwen3-0.6B \
  --tools tools.json --chat

# ANE-optimized inference
pmetal infer --model Qwen/Qwen3-0.6B --ane-max-seq-len 2048

# JIT-compiled sampling
pmetal infer --model Qwen/Qwen3-0.6B --compiled --chat

# TurboQuant KV cache compression
pmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chat

# DFlash/speculative backend
pmetal infer \
  --model Qwen/Qwen3-4B \
  --draft-model z-lab/Qwen3-4B-DFlash-b16 \
  --backend dflash \
  --prompt "Summarize speculative decoding"

# Profile Qwen 3.5 hybrid prefill + cached decode layers and write JSON
pmetal infer \
  --model unsloth/Qwen3.5-0.8B \
  --prompt "write a fizzbuzz program in python" \
  --chat --no-thinking --temperature 0 \
  --profile-layers \
  --profile-output .strategy/qwen35_layer_profile.json

Parameters

Parameter	Default	Description
`--model`	required	HuggingFace model ID or local path
`--prompt`	required	Input prompt
`--lora`	—	Path to LoRA adapter weights
`--temperature`	model default	Sampling temperature
`--top-k`	model default	Top-k sampling
`--top-p`	model default	Nucleus sampling
`--min-p`	model default	Min-p dynamic sampling
`--max-tokens`	`256`	Maximum generation length
`--repetition-penalty`	`1.0`	Repetition penalty
`--frequency-penalty`	`0.0`	Frequency penalty
`--presence-penalty`	`0.0`	Presence penalty
`--chat`	`false`	Apply chat template
`--no-thinking`	`false`	Disable thinking mode for models that support it
`--hide-thinking`	`false`	Hide thinking trace from output
`--mode`	`auto`	Model-family sampling preset
`--backend`	`auto`	`auto`, `standard`, `compiled`, `metal-sampler`, `ane`, `minimal`, or `dflash`
`--draft-model`	—	Draft model for speculative decoding
`--fp8`	`false`	FP8 weights (~2× mem reduction)
`--experts-dir`	—	Packed expert weights directory for SSD-offloaded MoE inference
`--compiled`	`false`	JIT-compiled sampling
`--profile-layers`	`false`	Run an opt-in per-layer forward profile for supported hybrid models
`--profile-output`	—	Write the layer profile report as pretty JSON
`--ane`	`false`	Enable ANE inference when compiled with `ane`
`--ane-max-seq-len`	`1024`	Max ANE kernel sequence length
`--ane-real-time`	`false`	Use experimental ANE real-time evaluation path
`--tools`	—	Tool definitions file (OpenAI format)
`--system`	—	System message
`--kv-quant`	—	KV cache bits: `8`, `4`, or `0`
`--kv-k-bits`	—	Key-only KV cache bits for asymmetric K/V models
`--kv-v-bits`	—	Value-only KV cache bits for asymmetric K/V models
`--kv-group-size`	`64`	KV quantization group size
`--kv-turboquant`	`false`	Enable TurboQuant KV compression
`--kv-turboquant-preset`	—	`q2_5` or `q3_5` mixed-bit TurboQuant preset
`--no-kv-quant`	`false`	Force fp16 KV cache
`--detect-repetition`	`false`	Enable n-gram repetition loop detection

Layer Profiling

--profile-layers is currently implemented for standard Qwen 3.5 / qwen3_next inference. It runs one real prefill pass and one real cached decode pass using the shared inference runner, forcing MLX evaluation at each measured section so the report reflects actual wall time instead of only op scheduling overhead.

Use --profile-output <PATH> to capture the full JSON report. The CLI summary now prints:

total layer time vs non-layer overhead
aggregated time by layer kind (linear_attention vs full_attention)
top section buckets within each kind
the slowest individual layers and their main sections

That makes long-prompt hybrid profiles much easier to read when you are deciding whether the next prompt-heavy optimization should target GDN prefill, full-attention preparation/SDPA, sparse MoE combine, or decode-only paths.

Chat Mode

With --chat, PMetal applies the model’s chat template and starts an interactive session:

> What is quantum entanglement?
Quantum entanglement is a phenomenon where two particles...

> Can you explain it more simply?
Think of it like two coins that always land on opposite sides...

Tool Use

Pass OpenAI-format tool definitions with --tools:

[
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        }
      }
    }
  }
]

Supported for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek models.