After the rotating buffer has wrapped (c.offset > c.maxSize) a subsequent
L>1 Update() went through a slice-to-[0, c.idx) path that discarded all
slots in [c.idx, Dim), losing the older-but-still-in-window tokens the
first Q of the new batch needs for its sliding-window attention.
Linearize the circular buffer to logical order in that wrapped case so
the existing trim + concat preserves the last (maxSize - 1) old tokens.
When the buffer has not yet wrapped (c.offset <= c.maxSize), slots
[c.idx, Dim) are grow padding or stale post-rewind data, so keep
dropping them.
Converts SiLU/GELUApprox to compiled kernels and adds SwiGLU,
matching upstream mlx/mlx_lm's activations pattern. Routes llama,
qwen3, qwen3_5 (dense + MoE), and glm4_moe_lite MLP paths through
mlx.SwiGLU so each MLP invocation runs as one fused Metal/CUDA
kernel rather than a chain of per-op launches.
Wraps MLX's mlx_compile API so Go functions can be traced into fused
kernels. Contiguous elementwise chains collapse into a single
Metal/CUDA kernel instead of launching one per op.
Exposes Compile plus arity helpers (Compile1/2/3) that mirror Python's
@mx.compile decorator shape, lazily building the closure on first call
so package-level declarations work before the MLX dylib loads.
* gemma4: implement Gemma 4 model for MLX (text-only runtime)
* gemma4: two MoE + SWA prefill perf fixes
Two performance optimizations in the gemma4 forward pass
1. Memoize the sliding-window prefill mask across layers.
2. Softmax only over the selected experts in Router.Forward.
* review comments
* mlx: add op wrappers for Conv2d, Pad, activations, trig, and masked SDPA
Add Conv2d, flexible Pad (with axes/mode), PadConstant, Maximum,
Minimum, Softplus, ReLU, GLU, Clamp, Sin, Cos, Clip,
ScaledDotProductAttentionMasked, and RoPEWithFreqs. Refactor
RoPEWithBase to delegate to RoPEWithFreqs.
* review comments
* mlx: fix ScaledDotProductAttentionMasked to consult the mask argument
Improve the MLX model creation pipeline with several model-agnostic changes:
- Rewrite supportsVision to use vision_config instead of architecture name
- Add supportsAudio for audio encoder detection
- Add alignment checking (isAligned) for quantization group sizes
- Support per-projection mixed quantization in MoE expert packing
- Record per-tensor quant metadata in safetensors blobs
- Parse per-tensor quant metadata at model load time
- Validate quantize output is non-empty before storing
- Fix pin/unpin cleanup in expert group quantization
- Promote v_proj/k_proj/down_proj to INT8 for INT4 base quant
- Add MetalIsAvailable() utility
- Skip audio encoder tensors from quantization
* pull: refine safetensors pull
- Body drain in resolve() — drain response body before close so Go's HTTP
client can reuse TCP connections instead of opening a new one per blob
(1,075 extra TCP+TLS handshakes eliminated)
- Skip speed recording for tiny blobs (<100KB) — prevents
HTTP-overhead-dominated transfer times from poisoning the median, which the
stall detector uses to cancel "too slow" downloads
- Resume support for large blobs (>=64MB) — on failure, preserves partial .tmp
files; on retry, re-hashes existing datak and sends Range header to download
only remaining bytes; gracefully falls back to full download if server returns
200 instead of 206; SHA256 verification catches corrupt partials
* harden push
- Prevents killing TCP connections after every request.
- Stronger backoff to handle server back-pressure and rate limiting
- Larger buffered reads for improve safetensor upload performance
- Better error message handling from server
- Handle 201 if server says blob exists
- Fix progress reporting on already uploaded blobs
- Trace logging to help troubleshoot and tune going forward
* review comments
* review comments
* create: Clean up experimental paths
This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions.
* create: preserve config and layer names when creating from safetensors models
When creating a model FROM an existing safetensors model, ModelFormat,
Capabilities, and layer Name fields were lost. ModelFormat stayed empty
because it's only set from GGML layers (which safetensors models lack),
and layer names weren't copied in parseFromModel. This caused derived
models to fail loading ("config.json not found in manifest").
* review comments
* mlx: Improve M5 performance with NAX
This modifies the Mac release to now have 2 builds of MLX for broader
compatibility while supporting the latest M5 hardware features. NAX requires
building with xcode 26.2 and targetting support only for OS v26 and up. Since
we want to support older MacOS versions as well, we now need 2 different MLX
builds and runtime detection logic to select the optimal version. The newer
build will detect NAX missing at runtime, so it is safe to run on pre M5 macs.
* mac: prevent generate on cross-compiles
For some versions of Xcode, cmake builds are failing due to header problems in
cross-compiling during the generate phase. Since generate is producing arch
independent generated output, we can skip this during cross-compiling.
Replace hardcoded Encode(prompt, true) with
Encode(prompt, r.Tokenizer.AddBOS()) so the pipeline respects each
model's tokenizer configuration.
Models with add_bos_token=true (gemma3, llama): unchanged, tokenizer
still prepends BOS.
Models with bos_token=null (qwen3, qwen3.5): unchanged, the BOS
guard (vocab.BOS >= 0) already prevented prepending regardless of
the flag.
This aligns the pipeline with the /v1/tokenize endpoint which already
uses Tokenizer.AddBOS().
Add periodic snapshots every 8k tokens and near the end of the prompt
so that long prompts can be partially restored and thinking/generation
can be retried without full reprocessing.
Update LRU last used time just on the nodes that actually used
during processing rather than all snapshots along the path. This
allows eviction to remove nodes more accurately so we can avoid
other heuristics to auto-merge nodes.
mlx.Copy shares the backing buffer with its source (via
copy_shared_buffer) rather than allocating independent storage.
When used to snapshot a slice of the KV cache, the snapshot array
holds the entire original cache buffer alive through the shared
data pointer — even after eval detaches the computation graph.
Replace Copy with Contiguous in Snapshot and Split. Contiguous
allocates a compact buffer when the source buffer is significantly
larger than the logical slice (Contiguous::eval checks
buffer_size > nbytes + 16384), which is always the case for KV
cache slices.
Previously, a partial match within a node's edge would truncate the path
to the parent snapshot - effectively making all cache types behave as
recurrent caches. Caches with only transformer layers can rewind to
arbitrary boundary so this restores this capability to improve cache
hits
* mlx: update to HEAD on 3/23
Also fixes a few misc vendoring bugs uncovered with this first update.
This also renames the version files to make them clearer.
* CUDA Fast Gated Delta kernel
* mlx: detect eval errors and panic
On model errors or missing kernels, don't mask the error, bubble it up.
Receiving from a buffered chan error consumes the value, so only the
first caller (WaitUntilRunning, HasExited, or Close) sees the signal.
Subsequent receivers block or take the wrong branch. Replace with a
closed chan struct{} which can be received from any number of times,
and store the error in a separate field.
The stderr reader used bufio.Scanner which has a 64KB max line size.
If the subprocess wrote a line exceeding this limit, the scanner would
stop reading, the OS pipe buffer would fill, and the subprocess would
deadlock.
Replace the scanner with a statusWriter that wraps io.Copy. The writer
forwards all stderr to os.Stderr while capturing the last short line
(≤256 bytes) for error reporting, avoiding both the deadlock and the
need to buffer arbitrarily long lines.
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
Slice used cmp.Or to resolve a zero stop value to the dimension size,
intended to support open-ended slices like a[i:]. This made Slice(0, 0)
indistinguishable from Slice(), so any slice with a zero stop would
silently include the entire dimension instead of being empty.
Replace cmp.Or with an explicit End sentinel and resolve negative
indices against the dimension size, matching Python/PyTorch semantics.
Add QuantizedEmbedding and EmbeddingLayer interface so models can
use quantized embedding weights and expose tied output projections.
This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5
to use the new interface.
This change adds a tensorImportTransform interface for model-specific
tensor transformations during safetensors import. This allows importing
and modifying the standard HF based weights as well as the mlx-community
derived pre-quantized safetensors repos to be directly
imported into `ollama create`. Right now this only works with Qwen3.5
importing which does tensor renaming, norm weight shifting (it
adds +1 to each value of the norm vectors), conv1d transposition,
and casts to BF16s for F32 based vectors.
MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.
Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
The CLI now links to the lazy-load MLX code, but that still happens in
init functions. On internal MLX errors, the CLI exits before it has a
chance to start. This change re-wires the MLX error handling so it
doesn't exit by default. The MLX based runners currently expect exits
on failure, so they re-initialize the default error handling. We can
refine error handling for better go stack traces in the future.
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
- Collapse MLX sampling state into a single sample.Sampler struct (options + history).
- Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms.
- Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step.
- Implement top_p, min_p, repeat_penalty, and frequency_penalty
GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>".
This is a known issue across the GLM family.
Sanitize the input to fix closing arg_key values so encoding/xml can handle it.
This change adds support for qwen3.5-next-moe models (qwen3-next/qwen3.5-next/qwen3-coder) to the MLX runner. It also:
* introduces recurrent cache support and related MLX ops
* updates pipeline/runner integration and adds tests
* properly quantizes stacked expert tensors
* a Gated Delta Metal kernel for fast SSM inference
* adds new MLX calls for Conv1d, DepthwideConv1d, Contiguous, Exp, Log, SoftmaxAxis