Skip to content

Apple Silicon Support

Hardware detection, per-chip capabilities, M1–M5 support matrix, and ANE integration.

PMetal automatically detects your Apple Silicon hardware at startup and tunes kernel parameters accordingly.

ChipGPU FamilyGPU CoresBW (GB/s)ANE TFLOPSNAXNotes
M1Apple78100~11No
M1 ProApple714200~11No
M1 MaxApple724400~11No
M1 UltraApple748800~22No2-die UltraFusion
M2Apple88100~15No
M2 ProApple816200~15No
M2 MaxApple830400~15No
M2 UltraApple848800~30No2-die UltraFusion
M3Apple91012015.8NoDynamic caching
M3 ProApple91827315.8No
M3 MaxApple93054615.8No
M3 UltraApple96080031.6No32 NE cores
M4Apple910120~12.2NoDynamic caching
M4 ProApple92027312.57No
M4 MaxApple94054610.93No
M4 UltraApple980800~24No2-die UltraFusion
M5Apple1010120~12.2YesNAX support
M5 ProApple1020273~12.3Yes
M5 MaxApple1040546TBDYes
M5 UltraApple1080800TBDYes2-die UltraFusion

ANE TFLOPS measured at FP16 via PMetal benchmarks.

PMetal detects at startup:

  • GPU family (Apple7–Apple10) and architecture generation
  • Device tier (Base/Pro/Max/Ultra)
  • GPU core count
  • ANE core count and availability
  • Memory bandwidth (persisted GPU copy benchmark with spec fallback)
  • NAX (M5+, Apple10)
  • Metal features (dynamic caching, mesh shaders)
  • UltraFusion topology (via sysctl hw.packages)

Use pmetal info to display all detected hardware properties.

On UltraFusion machines, pmetal info now also reports the local same-process executor scaffold: per-die resource slices, the in-memory stage count, and the heuristic 32 TB/s interconnect used by PMetal’s local UltraFusion planner.

Use pmetal bench-corpus to collect a comparable kernel report for the current machine. That corpus exercises standard-Metal hot paths on M1-M4, includes fused-merge plus initial MoE/model-family coverage across Apple7-Apple10, and adds MPP GEMM coverage on Apple10/M5 when NAX is available.

Use pmetal bench-workload when you want a fast end-to-end regression loop instead of only kernel microbenchmarks. The default workload uses the cached Qwen/Qwen3-0.6B model plus TeichAI/gemini-3-pro-preview-high-reasoning-250x, and the inference report now records the KV cache mode chosen on the current machine.

Across Apple7-10 GPUs, PMetal now benchmarks, persists, and reuses standard-Metal Tuna specializations for fused_swiglu, fused_mlp, fused_norm_lora, fused linear cross-entropy, and fused merge. Shared MLX MoE routers now use GPU-native argpartition selection instead of heavier full sorts or CPU top-k paths. M1-M4 remain first-class paths rather than “fallback” hardware. PMetal also now auto-selects the inference KV cache mode: on Apple7-Apple9 it prefers fp16 KV cache when the model plus context comfortably fit the device budget and only promotes to q8 when memory pressure justifies it.

M5 (Apple10, arch gen 17) introduces Neural Accelerator units within GPU cores for hardware-accelerated:

  • GEMM (fused matrix multiply)
  • Quantized inference (FP4/FP8)
  • Scaled dot-product attention

Accessed via Metal 4.0 (-std=metal4.0) kernels. NAX availability is checked via DeviceProperties::has_nax(). PMetal currently ships the Metal 4 dispatcher, NAX-capable hardware detection, persisted Apple10/M5 MPP dispatch tuning across 32×32, 64×32, 32×64, and 64×64 threadgroup variants, and persisted runtime selection between MLX fast SDPA, Metal FlashAttention, and MPP FlashAttention for supported no-custom-mask head_dim = 64, 80, 96, and 128 inference shapes, including softcapped configs. Remaining work is to track upstream MLX NAX API changes and keep pmetal-bridge bindings current.

PMetal’s ANE pipeline:

  • Dynamic Weight Pipeline: 9 MIL kernels compiled once at startup
  • Hybrid Inference: ANE prefill + CPU decode with KV cache
  • Power-of-2 bucketing: Optimal kernel compilation for sequence lengths
  • CPU RMSNorm: f32 computation on CPU to avoid fp16 ANE overflow
  • IOSurface Zero-Copy: Shared memory surfaces for CPU-ANE transfer
  • Experimental RT Eval: infer / serve support --ane-real-time, but PMetal still falls back to standard ANE if the private real-time path rejects the request on the current OS/framework; on the local M4 Max, both the tiny-kernel check and the generated SDPA forward probe still hit ANEProgramProcessRequestDirect() ... Program Inference error on 2026-03-23
  • M1–M5 Compatibility: Per-matrix blobs for M1, single-blob for M3+