pmetal serve
Start an HTTP inference server with OpenAI-compatible chat/completions/models/embeddings endpoints and an Anthropic-compatible messages endpoint. Requires the serve feature flag.
pmetal serve --model <MODEL> [OPTIONS]Examples
Section titled “Examples”# Start serverpmetal serve --model Qwen/Qwen3-0.6B --port 8080
# Continuous batching with TurboQuant KV cache compressionpmetal serve \ --model Qwen/Qwen3-0.6B \ --continuous-batch \ --kv-turboquant-preset q3_5 \ --cb-max-slots 8
# Serve a pre-fused adapter modelpmetal fuse \ --model Qwen/Qwen3-0.6B \ --lora ./output/lora_weights.safetensors \ --output ./output/fusedpmetal serve --model ./output/fused --port 8080API Compatibility
Section titled “API Compatibility”The server exposes OpenAI-compatible endpoints:
POST /v1/chat/completions— Chat completionsPOST /v1/completions— Text completionsPOST /v1/embeddings— EmbeddingsGET /v1/models— List loaded modelsPOST /v1/messages— Anthropic-compatible messages
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}]}'Serving Features
Section titled “Serving Features”- Continuous batching:
--continuous-batchenables concurrent slot scheduling with token-block admission. - Shared prefix cache: repeated chat prefixes are reused across requests.
- TurboQuant KV cache:
--kv-turboquant-preset q3_5gives near-lossless KV compression;q2_5trades more quality risk for more memory relief. - Standard KV quantization:
--kv-quant 8or--kv-quant 4can be used instead of TurboQuant. - Packed experts:
--experts-dirloads SSD-offloaded MoE expert packs. - ANE serving:
--aneand--ane-real-timeare experimental when theanefeature is enabled.
Parameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
--model | required | Model ID or local path |
--host | 127.0.0.1 | Bind host |
--port | 8080 | Bind port |
--max-seq-len | 4096 | KV cache sequence length |
--experts-dir | — | Packed expert weights directory |
--fp8 | false | Quantize weights to FP8 E4M3 at load time |
--kv-quant | — | KV cache quantization bits: 8, 4, or 0 |
--no-kv-quant | false | Force fp16 KV cache |
--kv-group-size | 64 | KV quantization group size |
--kv-turboquant | false | Enable TurboQuant KV cache compression |
--kv-turboquant-preset | — | q2_5 or q3_5 |
--continuous-batch | false | Enable continuous batching |
--cb-max-slots | 8 | Maximum active decode slots |
--cb-max-queue-depth | 256 | Maximum pending queue depth |
--cb-block-size | 32 | Token block size for admission |
--cb-max-blocks | 0 | Maximum active token blocks; 0 auto-selects |
See Also
Section titled “See Also”- pmetal infer — Interactive inference
- Feature Flags — Enable the serve feature