Kernel Tuning
Per-tier Metal kernel tuning — matrix tiles, FlashAttention blocks, threadgroup sizes, chunk sizes, and batch multipliers.
PMetal automatically selects kernel parameters based on your device tier and GPU family. For several hot paths, Tuna now persists the resolved specialization per device/problem shape and compiles the Metal shader with matching function constants rather than relying only on host-side heuristics.
Matrix Tile Size (GEMM, LoRA forward)
Section titled “Matrix Tile Size (GEMM, LoRA forward)”| Tier | Apple7–9 | Apple10 (M5+, NAX) |
|---|---|---|
| Base | 32×32×32 | 64×32×32 |
| Pro | 64×32×32 | 64×64×32 |
| Max | 64×64×32 | 128×64×32 |
| Ultra | 64×64×32 | 128×64×32 |
These tier tables apply to the standard Metal GEMM/LoRA kernels. On Apple10/M5 hardware, the Metal 4 / MPP dispatcher now auto-tunes and persists among 32×32 / 1-simdgroup, 64×32 / 2-simdgroup, 32×64 / 2-simdgroup, and 64×64 / 4-simdgroup variants, plus Morton-vs-linear tile walk order. Aligned M/N tiles use static full-tile extents. Apple7-9 continue to use the standard Metal kernels. For 4-bit affine quantized linear inference, Apple10/M5 also benchmarks and persists MLX quantized_matmul versus the MPP path per device and problem shape. Standard Metal FlashAttention now benchmarks and persists among the known-valid per-head-dimension block pairs instead of relying only on the tier table below. For supported no-custom-mask head_dim = 64, 80, 96, and 128 inference attention shapes, including softcapped configs, Apple10/M5 also benchmarks and persists MLX fast SDPA vs Metal FlashAttention vs MPP FlashAttention, rejecting MPP candidates that diverge numerically from the MLX reference.
FlashAttention Block Sizes
Section titled “FlashAttention Block Sizes”Baseline block-size seed per head dimension before persisted tuning:
| Head Dim | Base | Pro | Max | Ultra |
|---|---|---|---|---|
| 64 | 64×32 | 64×64 | 64×64 | 64×64 |
| 80 | 32×32 | 64×32 | 64×32 | 64×32 |
| 96 | 32×32 | 64×32 | 64×32 | 64×32 |
| 128 | 32×32 | 32×32 | 64×32 | 64×32 |
| 256 | 16×16 | 16×16 | 32×16 | 32×16 |
Fused Kernel Threadgroup Sizes
Section titled “Fused Kernel Threadgroup Sizes”Fused RMSNorm + LoRA
Section titled “Fused RMSNorm + LoRA”Tuna now benchmarks and persists THREADS_PER_TOKEN plus the tiled/non-tiled path choice for this kernel. The table below is the heuristic seed used to order the benchmark candidates.
| Tier | Threads / Token | Tiled Path |
|---|---|---|
| Base | 128 | out_features > 256 |
| Pro | 256 | out_features > 256 |
| Max | 512 | out_features > 256 |
| Ultra | 512 | out_features > 256 |
Fused SwiGLU / Fused MLP
Section titled “Fused SwiGLU / Fused MLP”Tuna now benchmarks and persists THREADS_PER_TOKEN and SWIGLU_CHUNK_SIZE for these standard-Metal kernels. The table below is the heuristic seed used to order the benchmark candidates.
| Tier | Threads / Token | Chunk Size |
|---|---|---|
| Base | 128 or 256 | 1024 or 2048 |
| Pro | 256 | 2048 or 4096 |
| Max | 512 | 2048 or 4096 |
| Ultra | 512 | 2048 or 4096 |
Fused Linear Cross-Entropy
Section titled “Fused Linear Cross-Entropy”Tuna now benchmarks and persists fused linear cross-entropy threadgroup size and default chunk size per device, dtype, and problem shape. The table below is the heuristic seed used to order the benchmark candidates.
| Tier | Threads / Token | Chunk Size |
|---|---|---|
| Base | 128 / 256 / 512 | 1024 or 2048 |
| Pro | 256 / 512 / 1024 | 2048 or 4096 |
| Max | 256 / 512 / 1024 | 4096 or 8192 |
| Ultra | 256 / 512 / 1024 | 4096 or 8192 |
CE_THREADS_PER_TOKEN still scales with vocabulary size and clamps to the hardware maximum. Base-tier devices seed at 128 / 256 / 512 across < 32k, 32k..128k, and > 128k vocabularies, while Pro / Max / Ultra seed at 256 / 512 / 1024. Wider hidden states start from a smaller chunk-size seed before benchmarking.
Fused Merge
Section titled “Fused Merge”Tuna now benchmarks and persists threads_per_group and elements_per_thread for the standard-Metal fused-merge elementwise kernels. This path stays first-class on Apple7-Apple9 and is also reused on Apple10 for non-MPP merge workloads.
| Tier | Threads / Group | Elements / Thread |
|---|---|---|
| Base | 128 or 256 | 2, 4, or 8 |
| Pro | 128, 256, or 512 | 4 or 8 |
| Max | 256 or 512 | 4 or 8 |
| Ultra | 256 or 512 | 4 or 8 |
The candidate ordering is tier-aware, but the final result is benchmarked per device and problem shape and stored in merge.json. Persistent merge and LoRA-forward cache keys now include device identity so results are safe to reuse across different Apple Silicon tiers and bins.
Batch Size Multiplier
Section titled “Batch Size Multiplier”| Tier | Multiplier |
|---|---|
| Base | 1× |
| Pro | 2× |
| Max | 4× |
| Ultra | 8× |
Metal GPU Optimizations
Section titled “Metal GPU Optimizations”| Kernel | Description |
|---|---|
| FlashAttention | O(n) memory attention with fused softmax, tier-aware block sizes |
| Fused GDN | Gated Delta Network recurrence kernel — single-pass state update |
| Fused LoRA | Combined forward pass for adapter layers (~2× speedup) |
| Fused Cross-Entropy | Chunked vocabulary loss computation |
| Fused Linear Cross-Entropy | Skips logits materialization entirely, with tuned chunk/thread specialization |
| Fused RoPE | Rotary position embeddings in-kernel |
| Fused SwiGLU | Fused gate + activation with benchmarked-and-persisted Tuna thread/chunk specialization |
| Fused RMSNorm + LoRA | Combined normalization and adapter projection with benchmarked-and-persisted Tuna thread/tiled specialization |
| Fused Sampler | JIT-compiled token sampling |
| Fused MLP | Combined gate/up/down projections |
| Async Scheduler | Double/triple-buffered GPU command scheduling |
See Also
Section titled “See Also”- Apple Silicon Support — Hardware detection matrix
- pmetal bench — Benchmark on your hardware