Skip to content

Apple Silicon Support

PMetal automatically detects your Apple Silicon hardware at startup and tunes kernel parameters accordingly.

ChipGPU FamilyGPU CoresBW (GB/s)ANE TFLOPSNAXNotes
M1Apple78100~11No
M1 ProApple714200~11No
M1 MaxApple724400~11No
M1 UltraApple748800~22No2-die UltraFusion
M2Apple88100~15No
M2 ProApple816200~15No
M2 MaxApple830400~15No
M2 UltraApple848800~30No2-die UltraFusion
M3Apple91012015.8NoDynamic caching
M3 ProApple91827315.8No
M3 MaxApple93054615.8No
M3 UltraApple96080031.6No32 NE cores
M4Apple910120~12.2NoDynamic caching
M4 ProApple92027312.57No
M4 MaxApple94054610.93No
M4 UltraApple980800~24No2-die UltraFusion
M5Apple1010120~12.2YesNAX support
M5 ProApple1020273~12.3Yes
M5 MaxApple1040546TBDYes
M5 UltraApple1080800TBDYes2-die UltraFusion

ANE TFLOPS measured at FP16 via PMetal benchmarks.

PMetal detects at startup:

  • GPU family (Apple7–Apple10) and architecture generation
  • Device tier (Base/Pro/Max/Ultra)
  • GPU core count
  • ANE core count and availability
  • Memory bandwidth (tier + family lookup)
  • NAX (M5+, Apple10)
  • Metal features (dynamic caching, mesh shaders)
  • UltraFusion topology (via sysctl hw.packages)

Use pmetal info to display all detected hardware properties.

M5 (Apple10, arch gen 17) introduces Neural Accelerator units within GPU cores for hardware-accelerated:

  • GEMM (fused matrix multiply)
  • Quantized inference (FP4/FP8)
  • Scaled dot-product attention

Accessed via Metal 4.0 (-std=metal4.0) kernels. NAX availability is checked via DeviceProperties::has_nax().

PMetal’s ANE pipeline:

  • Dynamic Weight Pipeline: 9 MIL kernels compiled once at startup
  • Hybrid Inference: ANE prefill + CPU decode with KV cache
  • Power-of-2 bucketing: Optimal kernel compilation for sequence lengths
  • CPU RMSNorm: f32 computation on CPU to avoid fp16 ANE overflow
  • IOSurface Zero-Copy: Shared memory surfaces for CPU-ANE transfer
  • M1–M5 Compatibility: Per-matrix blobs for M1, single-blob for M3+