Skip to content

Kernel Tuning

PMetal automatically selects kernel parameters based on your device tier and GPU family.

TierApple7–9Apple10 (M5+, NAX)
Base32×32×3264×32×32
Pro64×32×3264×64×32
Max64×64×32128×64×32
Ultra64×64×32128×64×32

Block size selection per head dimension:

Head DimBaseProMaxUltra
6464×3264×3264×6464×64
8064×3264×3264×6464×64
9664×3264×3264×6464×64
12832×3232×3264×6464×64
25632×1632×1632×3232×32
TierThreadgroup Size
Base128
Pro128
Max256
Ultra256
TierThreadgroup Size
Base256
Pro256
Max512
Ultra512
TierMultiplier
Base
Pro
Max
Ultra
KernelDescription
FlashAttentionO(n) memory attention with fused softmax, tier-aware block sizes
Fused GDNGated Delta Network recurrence kernel — single-pass state update
Fused LoRACombined forward pass for adapter layers (~2× speedup)
Fused Cross-EntropyUnsloth-style chunked loss computation
Fused Linear Cross-EntropySkips logits materialization entirely
Fused RoPERotary position embeddings in-kernel
Fused SwiGLUFused gate + activation with tier-tuned threadgroups
Fused RMSNorm + LoRACombined normalization and adapter projection
Fused SamplerJIT-compiled token sampling
Fused MLPCombined gate/up/down projections
Async SchedulerDouble/triple-buffered GPU command scheduling