Skip to content

pmetal infer

Run inference on a loaded model. Supports interactive chat, tool/function calling, thinking mode, FP8 quantization, and LoRA adapter loading.

Terminal window
pmetal infer \
--model <MODEL> \
[--prompt <PROMPT>] \
[OPTIONS]
Terminal window
# Simple generation
pmetal infer --model Qwen/Qwen3-0.6B --prompt "What is 2+2?"
# Interactive chat with LoRA
pmetal infer \
--model Qwen/Qwen3-0.6B \
--lora ./output/lora_weights.safetensors \
--chat --show-thinking
# FP8 quantized inference (2× memory reduction)
pmetal infer --model Qwen/Qwen3-4B --fp8 --chat
# With tool definitions
pmetal infer \
--model Qwen/Qwen3-0.6B \
--tools tools.json --chat
# ANE-optimized inference
pmetal infer --model Qwen/Qwen3-0.6B --ane-max-seq-len 2048
# JIT-compiled sampling
pmetal infer --model Qwen/Qwen3-0.6B --compiled --chat
ParameterDefaultDescription
--modelrequiredHuggingFace model ID or local path
--promptInput prompt (omit for stdin)
--loraPath to LoRA adapter weights
--temperaturemodel defaultSampling temperature
--top-kmodel defaultTop-k sampling
--top-pmodel defaultNucleus sampling
--min-pmodel defaultMin-p dynamic sampling
--max-tokens256Maximum generation length
--repetition-penalty1.0Repetition penalty
--frequency-penalty0.0Frequency penalty
--presence-penalty0.0Presence penalty
--chatfalseApply chat template
--show-thinkingfalseShow reasoning content
--fp8falseFP8 weights (~2× mem reduction)
--compiledfalseJIT-compiled sampling
--no-anefalseDisable ANE inference
--ane-max-seq-len1024Max ANE kernel sequence length
--toolsTool definitions file (OpenAI format)
--systemSystem message

With --chat, PMetal applies the model’s chat template and starts an interactive session:

> What is quantum entanglement?
Quantum entanglement is a phenomenon where two particles...
> Can you explain it more simply?
Think of it like two coins that always land on opposite sides...

Pass OpenAI-format tool definitions with --tools:

[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
}
}
}
}
]

Supported for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek models.