Release Notes
hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)
hex-mm: initial support for F32 * F32 -> F32 matmuls
hex-rms-norm: fix src1 stride use in fused rms_norm_mul
hex-ops: clear spad pointers in the ops that clober it
This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.
- hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX
Decided to use Q4_0 * F32 -> F32 matmul for this. Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16. Super simple and pretty efficient.
hmx-mm: route f16 2D matmuls through the same kernel used for all other types
hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way
This update futher improves matmul performance and at the same time removes most of the redudant logic we had in different paths.
hmx-fa: slighlty improved pipeline simimar to matmul updates
hmx-mm: initial version of MAT_MUL_ID support for HMX
hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
hex-gdn: optimize GATED_DELTA_NET
DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)
hmx-mm: missed one more case where we can use fastmod
hexagon: update DCVS settings for a slight perf bump
hmx-fa: use fastdiv in hmx-flash-attn
hmx-fa: precompute slope values to avoid disrupting the inner loop
hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32) DISABLED
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.3 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL) DISABLED
- Windows x64 (HIP)
openEuler:
- DISABLED
- openEuler x86 (310p)
- openEuler x86 (910b, ACL Graph)
- openEuler aarch64 (310p)
- openEuler aarch64 (910b, ACL Graph)
UI: