* pull: refine safetensors pull
- Body drain in resolve() — drain response body before close so Go's HTTP
client can reuse TCP connections instead of opening a new one per blob
(1,075 extra TCP+TLS handshakes eliminated)
- Skip speed recording for tiny blobs (<100KB) — prevents
HTTP-overhead-dominated transfer times from poisoning the median, which the
stall detector uses to cancel "too slow" downloads
- Resume support for large blobs (>=64MB) — on failure, preserves partial .tmp
files; on retry, re-hashes existing datak and sends Range header to download
only remaining bytes; gracefully falls back to full download if server returns
200 instead of 206; SHA256 verification catches corrupt partials
* harden push
- Prevents killing TCP connections after every request.
- Stronger backoff to handle server back-pressure and rate limiting
- Larger buffered reads for improve safetensor upload performance
- Better error message handling from server
- Handle 201 if server says blob exists
- Fix progress reporting on already uploaded blobs
- Trace logging to help troubleshoot and tune going forward
* review comments
* review comments
* create: Clean up experimental paths
This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions.
* create: preserve config and layer names when creating from safetensors models
When creating a model FROM an existing safetensors model, ModelFormat,
Capabilities, and layer Name fields were lost. ModelFormat stayed empty
because it's only set from GGML layers (which safetensors models lack),
and layer names weren't copied in parseFromModel. This caused derived
models to fail loading ("config.json not found in manifest").
* review comments
* mlx: Improve M5 performance with NAX
This modifies the Mac release to now have 2 builds of MLX for broader
compatibility while supporting the latest M5 hardware features. NAX requires
building with xcode 26.2 and targetting support only for OS v26 and up. Since
we want to support older MacOS versions as well, we now need 2 different MLX
builds and runtime detection logic to select the optimal version. The newer
build will detect NAX missing at runtime, so it is safe to run on pre M5 macs.
* mac: prevent generate on cross-compiles
For some versions of Xcode, cmake builds are failing due to header problems in
cross-compiling during the generate phase. Since generate is producing arch
independent generated output, we can skip this during cross-compiling.
* mlx: update to HEAD on 3/23
Also fixes a few misc vendoring bugs uncovered with this first update.
This also renames the version files to make them clearer.
* CUDA Fast Gated Delta kernel
* mlx: detect eval errors and panic
On model errors or missing kernels, don't mask the error, bubble it up.
This change adds a tensorImportTransform interface for model-specific
tensor transformations during safetensors import. This allows importing
and modifying the standard HF based weights as well as the mlx-community
derived pre-quantized safetensors repos to be directly
imported into `ollama create`. Right now this only works with Qwen3.5
importing which does tensor renaming, norm weight shifting (it
adds +1 to each value of the norm vectors), conv1d transposition,
and casts to BF16s for F32 based vectors.
MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.
Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
The CLI now links to the lazy-load MLX code, but that still happens in
init functions. On internal MLX errors, the CLI exits before it has a
chance to start. This change re-wires the MLX error handling so it
doesn't exit by default. The MLX based runners currently expect exits
on failure, so they re-initialize the default error handling. We can
refine error handling for better go stack traces in the future.
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
The MLX runner previously reported a static VRAM estimate that was
computed at load time and consisted only of the weights. This is
strictly less than the actual memory usage, as it does not include
the KV cache or compute graph.
This change adds a new MLX based runner which includes:
* Method-based MLX bindings
* Subprocess-based MLX runner (x/mlxrunner)
* KV cache with tree management
* A basic sampler
The GLM4-MoE-Lite model has been ported to use the new bindings.
---------
Co-authored-by: Michael Yang <git@mxy.ng>
This change includes:
- changes to the safetensors metadata format
- changes to the create command to properly create the blobs with the new format
- changes to load the new format
- fixes ollama show to properly show each tensor
When context length is clamped to the model's trained context length,
ollama ps now shows the actual clamped value instead of the originally
configured value.
- Fix panic in ollama show for image gen models (safe type assertion)
- Add vision capability for Flux2KleinPipeline models at create time
- Flatten transparent PNG images onto white background for better results
Remove static VRAM estimation (EstimateVRAM, CheckMemoryRequirements)
which wasn't helpful. Instead, report the actual tensor weight size
from the manifest for ollama ps.
- Remove memory estimation check from runner startup
- Remove EstimateVRAM, CheckMemoryRequirements, modelVRAMEstimates
- Add TotalTensorSize() to get actual weight size from manifest
- Use weight size for Server.vramSize instead of estimates
Note: This is better than showing 0 or inaccurate estimates, but the
weight size is a drastic underestimation of actual memory usage since
it doesn't account for activations, intermediate tensors, or MLX
overhead. Future work should query real-time memory from MLX
(e.g., MetalGetActiveMemory) for accurate reporting.
Remove the Qwen image generation and image editing model packages
to clean up the codebase. These models will be reintroduced later.
- Delete x/imagegen/models/qwen_image/ (10 files)
- Delete x/imagegen/models/qwen_image_edit/ (5 files)
- Remove related CLI flags and imports from cmd/engine/main.go
- Update comments in cache/step.go to remove Qwen-specific references
Add --quantize fp4 support to ollama create for image generation models
(flux2, z-image-turbo), using MLX's affine 4-bit quantization.
Changes:
- Add fp4 to validation in CreateImageGenModel
- Add FP4 case to quantizeTensor (group_size=32, bits=4, affine mode)
- Add GetQuantization() to WeightSource interface for dynamic params
- Update LoadLinearLayer to use quantization params from model metadata
* MLX - dynamic loading of mlx-c
Create a wrapper layer to indirect the dependency on mlx-c so
the main ollama binary does not have a load-time dependency on mlx-c, mlx, and on linux, cuda. Lazy load the library via dlopen
so we can adjust the path to ensure the dependencies are found
and fail gracefully if not present.
* review comments
* fix broken tests
* x: make `ollama create --experimental` import from safetensors
This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.
* gofumpt the linter
* gofumpt the linter again
* validate the model name
- Install mlx.metallib for arm64 builds (required for Metal GPU acceleration)
- Apply rpath settings to all macOS builds, not just x86_64
- Add CMAKE_BUILD_WITH_INSTALL_RPATH to avoid install_name_tool errors
- Update build_darwin.sh to copy, sign, and package the metallib
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%
FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8
Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
Removes 5-minute HTTP client timeout that caused "context deadline
exceeded" errors on large file downloads. Stall detection (10s)
already handles unresponsive connections.
Fixes progress bar total going down on resume by calculating total
from all blobs upfront and reporting already-downloaded bytes
as completed immediately.
* WIP - MLX backend with gemma3
* MLX: add cmake and go tag build toggles
To build the new MLX backend code:
cmake --preset MLX
cmake --build --preset MLX --parallel
cmake --install build --component MLX
go build -tags mlx .
Note: the main.go entrypoint for the MLX engine will change in a follow up commit.
* add experimental image generation runtime
* add experimental image generation runtime
* MLX: wire up cuda build for linux
* MLX: get dependencies correct and dedup
This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
directory.
* fix relative link bug in dedup
* Add darwin build and readme
* add go build tag for mlx dependent code and wire up build_darwin.sh
* lint cleanup
* macos: build mlx for x86
This will be CPU only.
* cuda build instructions and fix drift from mlx bump
* stale comment
* Delete agent helper doc
* Clean up readme.md
* Revise README for tokenizer clarity and details
Updated README to clarify tokenizer functionality and removed correctness section.
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>