Skip to content

pmetal serve

Start an HTTP inference server with an OpenAI-compatible API. Requires the serve feature flag.

Terminal window
pmetal serve --model <MODEL> [OPTIONS]
Terminal window
# Start server
pmetal serve --model Qwen/Qwen3-0.6B --port 8080
# With LoRA adapter
pmetal serve --model Qwen/Qwen3-0.6B --lora ./output/lora_weights.safetensors --port 8080

The server exposes OpenAI-compatible endpoints:

  • POST /v1/chat/completions — Chat completions
  • POST /v1/completions — Text completions
  • GET /v1/models — List loaded models
Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}]}'