PMetal automatically selects kernel parameters based on your device tier and GPU family.
| Tier | Apple7–9 | Apple10 (M5+, NAX) |
|---|
| Base | 32×32×32 | 64×32×32 |
| Pro | 64×32×32 | 64×64×32 |
| Max | 64×64×32 | 128×64×32 |
| Ultra | 64×64×32 | 128×64×32 |
Block size selection per head dimension:
| Head Dim | Base | Pro | Max | Ultra |
|---|
| 64 | 64×32 | 64×32 | 64×64 | 64×64 |
| 80 | 64×32 | 64×32 | 64×64 | 64×64 |
| 96 | 64×32 | 64×32 | 64×64 | 64×64 |
| 128 | 32×32 | 32×32 | 64×64 | 64×64 |
| 256 | 32×16 | 32×16 | 32×32 | 32×32 |
| Tier | Threadgroup Size |
|---|
| Base | 128 |
| Pro | 128 |
| Max | 256 |
| Ultra | 256 |
| Tier | Threadgroup Size |
|---|
| Base | 256 |
| Pro | 256 |
| Max | 512 |
| Ultra | 512 |
| Tier | Multiplier |
|---|
| Base | 1× |
| Pro | 2× |
| Max | 4× |
| Ultra | 8× |
| Kernel | Description |
|---|
| FlashAttention | O(n) memory attention with fused softmax, tier-aware block sizes |
| Fused GDN | Gated Delta Network recurrence kernel — single-pass state update |
| Fused LoRA | Combined forward pass for adapter layers (~2× speedup) |
| Fused Cross-Entropy | Unsloth-style chunked loss computation |
| Fused Linear Cross-Entropy | Skips logits materialization entirely |
| Fused RoPE | Rotary position embeddings in-kernel |
| Fused SwiGLU | Fused gate + activation with tier-tuned threadgroups |
| Fused RMSNorm + LoRA | Combined normalization and adapter projection |
| Fused Sampler | JIT-compiled token sampling |
| Fused MLP | Combined gate/up/down projections |
| Async Scheduler | Double/triple-buffered GPU command scheduling |