DeepSeek-V2-style aux-loss-free routing computes sigmoid(gates) once but
needs it twice: the raw sigmoid output is gathered after top-k, while the
post-bias negation is the argpartition key. Fuse into a single multi-output
Compiled kernel returning both, saving two launches on the routing path
per token. Exposed as a general SigmoidRouter since the same pattern is
shared across DeepSeek-V2 descendants.
Improves glm4.7 generation performance by approximately 1%.
Converts SiLU/GELUApprox to compiled kernels and adds SwiGLU,
matching upstream mlx/mlx_lm's activations pattern. Routes llama,
qwen3, qwen3_5 (dense + MoE), and glm4_moe_lite MLP paths through
mlx.SwiGLU so each MLP invocation runs as one fused Metal/CUDA
kernel rather than a chain of per-op launches.
* mlx: add op wrappers for Conv2d, Pad, activations, trig, and masked SDPA
Add Conv2d, flexible Pad (with axes/mode), PadConstant, Maximum,
Minimum, Softplus, ReLU, GLU, Clamp, Sin, Cos, Clip,
ScaledDotProductAttentionMasked, and RoPEWithFreqs. Refactor
RoPEWithBase to delegate to RoPEWithFreqs.
* review comments
* mlx: fix ScaledDotProductAttentionMasked to consult the mask argument
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
This change adds a new MLX based runner which includes:
* Method-based MLX bindings
* Subprocess-based MLX runner (x/mlxrunner)
* KV cache with tree management
* A basic sampler
The GLM4-MoE-Lite model has been ported to use the new bindings.
---------
Co-authored-by: Michael Yang <git@mxy.ng>