Skip to content

pmetal infer

Run inference on a loaded model. Supports interactive chat, tool/function calling, thinking mode controls, sampling presets, multiple execution backends, FP8 weights, LoRA adapters, packed experts, KV cache quantization, TurboQuant KV compression, DFlash/speculative paths, and opt-in per-layer profiling for supported hybrid models.

Terminal window
pmetal infer \
--model <MODEL> \
--prompt <PROMPT> \
[OPTIONS]
Terminal window
# Simple generation
pmetal infer --model Qwen/Qwen3-0.6B --prompt "What is 2+2?"
# Interactive chat with LoRA
pmetal infer \
--model Qwen/Qwen3-0.6B \
--lora ./output/lora_weights.safetensors \
--chat
# FP8 quantized inference (2× memory reduction)
pmetal infer --model Qwen/Qwen3-4B --fp8 --chat
# With tool definitions
pmetal infer \
--model Qwen/Qwen3-0.6B \
--tools tools.json --chat
# ANE-optimized inference
pmetal infer --model Qwen/Qwen3-0.6B --ane-max-seq-len 2048
# JIT-compiled sampling
pmetal infer --model Qwen/Qwen3-0.6B --compiled --chat
# TurboQuant KV cache compression
pmetal infer --model Qwen/Qwen3-0.6B --kv-turboquant-preset q3_5 --chat
# DFlash/speculative backend
pmetal infer \
--model Qwen/Qwen3-4B \
--draft-model z-lab/Qwen3-4B-DFlash-b16 \
--backend dflash \
--prompt "Summarize speculative decoding"
# Profile Qwen 3.5 hybrid prefill + cached decode layers and write JSON
pmetal infer \
--model unsloth/Qwen3.5-0.8B \
--prompt "write a fizzbuzz program in python" \
--chat --no-thinking --temperature 0 \
--profile-layers \
--profile-output .strategy/qwen35_layer_profile.json
ParameterDefaultDescription
--modelrequiredHuggingFace model ID or local path
--promptrequiredInput prompt
--loraPath to LoRA adapter weights
--temperaturemodel defaultSampling temperature
--top-kmodel defaultTop-k sampling
--top-pmodel defaultNucleus sampling
--min-pmodel defaultMin-p dynamic sampling
--max-tokens256Maximum generation length
--repetition-penalty1.0Repetition penalty
--frequency-penalty0.0Frequency penalty
--presence-penalty0.0Presence penalty
--chatfalseApply chat template
--no-thinkingfalseDisable thinking mode for models that support it
--hide-thinkingfalseHide thinking trace from output
--modeautoModel-family sampling preset
--backendautoauto, standard, compiled, metal-sampler, ane, minimal, or dflash
--draft-modelDraft model for speculative decoding
--fp8falseFP8 weights (~2× mem reduction)
--experts-dirPacked expert weights directory for SSD-offloaded MoE inference
--compiledfalseJIT-compiled sampling
--profile-layersfalseRun an opt-in per-layer forward profile for supported hybrid models
--profile-outputWrite the layer profile report as pretty JSON
--anefalseEnable ANE inference when compiled with ane
--ane-max-seq-len1024Max ANE kernel sequence length
--ane-real-timefalseUse experimental ANE real-time evaluation path
--toolsTool definitions file (OpenAI format)
--systemSystem message
--kv-quantKV cache bits: 8, 4, or 0
--kv-k-bitsKey-only KV cache bits for asymmetric K/V models
--kv-v-bitsValue-only KV cache bits for asymmetric K/V models
--kv-group-size64KV quantization group size
--kv-turboquantfalseEnable TurboQuant KV compression
--kv-turboquant-presetq2_5 or q3_5 mixed-bit TurboQuant preset
--no-kv-quantfalseForce fp16 KV cache
--detect-repetitionfalseEnable n-gram repetition loop detection

--profile-layers is currently implemented for standard Qwen 3.5 / qwen3_next inference. It runs one real prefill pass and one real cached decode pass using the shared inference runner, forcing MLX evaluation at each measured section so the report reflects actual wall time instead of only op scheduling overhead.

Use --profile-output <PATH> to capture the full JSON report. The CLI summary now prints:

  • total layer time vs non-layer overhead
  • aggregated time by layer kind (linear_attention vs full_attention)
  • top section buckets within each kind
  • the slowest individual layers and their main sections

That makes long-prompt hybrid profiles much easier to read when you are deciding whether the next prompt-heavy optimization should target GDN prefill, full-attention preparation/SDPA, sparse MoE combine, or decode-only paths.

With --chat, PMetal applies the model’s chat template and starts an interactive session:

> What is quantum entanglement?
Quantum entanglement is a phenomenon where two particles...
> Can you explain it more simply?
Think of it like two coins that always land on opposite sides...

Pass OpenAI-format tool definitions with --tools:

[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
}
}
}
}
]

Supported for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek models.