b9866
b9866
View on GitHubView PackagePublished: Jul 3, 2026

Release Notes

cuda: enable topk-moe fusion for 288 experts (#25267)

  • cuda: enable topk-moe fusion for 288 experts

The topk-moe fusion only accepted power-of-2 expert counts (or the special-cased 576), so models with 288 experts (e.g. Step-3.7-Flash) fell back to the unfused per-layer routing chain: softmax/sigmoid, argsort, get_rows, sum_rows, div, clamp, scale. At batch size 1 that is ~330 extra tiny graph nodes per token.

288 is a multiple of the warp size, so the existing kernel already handles it; this adds the missing template instantiation and accepts 288 in the eligibility check.

Measured on gfx1151 with Step-3.7-Flash IQ4_XS (llama-bench, -b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; machine idle, before/after paired so pp4096 stays matched as a load control):

test | before | after ----------------+----------------+---------------- pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (unchanged) tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%) tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (unchanged)

Prompt processing is unaffected (the fusion only touches decode routing). The decode gain is ~+2.4% at shallow context and fades with depth: by 30k tokens each step is attention-bound over the KV cache, so removing the fixed routing overhead is no longer visible.

Assisted-By: Claude Fable 5 [email protected]

  • Update tests/test-backend-ops.cpp

Co-authored-by: Oliver Simons [email protected]

  • Add comment for case 288 in topk-moe.cu

Co-authored-by: Oliver Simons [email protected]

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI: