Skip to content

pmetal serve

Start an HTTP inference server with OpenAI-compatible chat/completions/models/embeddings endpoints and an Anthropic-compatible messages endpoint. Requires the serve feature flag.

Terminal window
pmetal serve --model <MODEL> [OPTIONS]
Terminal window
# Start server
pmetal serve --model Qwen/Qwen3-0.6B --port 8080
# Continuous batching with TurboQuant KV cache compression
pmetal serve \
--model Qwen/Qwen3-0.6B \
--continuous-batch \
--kv-turboquant-preset q3_5 \
--cb-max-slots 8
# Serve a pre-fused adapter model
pmetal fuse \
--model Qwen/Qwen3-0.6B \
--lora ./output/lora_weights.safetensors \
--output ./output/fused
pmetal serve --model ./output/fused --port 8080

The server exposes OpenAI-compatible endpoints:

  • POST /v1/chat/completions — Chat completions
  • POST /v1/completions — Text completions
  • POST /v1/embeddings — Embeddings
  • GET /v1/models — List loaded models
  • POST /v1/messages — Anthropic-compatible messages
Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}]}'
  • Continuous batching: --continuous-batch enables concurrent slot scheduling with token-block admission.
  • Shared prefix cache: repeated chat prefixes are reused across requests.
  • TurboQuant KV cache: --kv-turboquant-preset q3_5 gives near-lossless KV compression; q2_5 trades more quality risk for more memory relief.
  • Standard KV quantization: --kv-quant 8 or --kv-quant 4 can be used instead of TurboQuant.
  • Packed experts: --experts-dir loads SSD-offloaded MoE expert packs.
  • ANE serving: --ane and --ane-real-time are experimental when the ane feature is enabled.
ParameterDefaultDescription
--modelrequiredModel ID or local path
--host127.0.0.1Bind host
--port8080Bind port
--max-seq-len4096KV cache sequence length
--experts-dirPacked expert weights directory
--fp8falseQuantize weights to FP8 E4M3 at load time
--kv-quantKV cache quantization bits: 8, 4, or 0
--no-kv-quantfalseForce fp16 KV cache
--kv-group-size64KV quantization group size
--kv-turboquantfalseEnable TurboQuant KV cache compression
--kv-turboquant-presetq2_5 or q3_5
--continuous-batchfalseEnable continuous batching
--cb-max-slots8Maximum active decode slots
--cb-max-queue-depth256Maximum pending queue depth
--cb-block-size32Token block size for admission
--cb-max-blocks0Maximum active token blocks; 0 auto-selects