Two reductions:
1. Drop the gguf_rename_tensor forwarder from gguf.h/gguf.cpp.
The rename-in-place trick it does (calling ggml_set_name on an embedded
ggml_tensor) can be done from outside gguf.cpp via:
char * p = const_cast<char *>(gguf_get_tensor_name(meta, id));
strncpy(p, new_name, GGML_MAX_NAME - 1);
That pointer aims into a mutable char[GGML_MAX_NAME] inside a std::vector
element; the const on the return type is API courtesy. This is defined
behavior and has no struct-layout dependency.
2. Drop the src/CMakeLists.txt hunk that added llama-ollama-compat.cpp to
the llama target. Replace with a target_sources() call in Ollama's
llama/server/CMakeLists.txt after FetchContent_MakeAvailable. Our
compat files now stay in llama/compat/ and are never copied into the
fetched _deps/ tree.
Net patch now touches 3 files, 20 lines, all pure call-site insertions:
src/llama-model-loader.cpp +8 (include + translate + 2x should_skip)
src/llama-model.cpp +4 (include + apply_tensor_transforms)
tools/mtmd/clip.cpp +8 (include + translate_clip + maybe_load)
Verified: fresh build from scratch (rm -rf build && cmake configure)
runs PATCH_COMMAND cleanly, compiles, and ollama run gemma3 still works
end-to-end for text + vision.
llama.cpp compatibility shim
This directory holds an in-process compatibility layer that lets upstream
llama-server load GGUFs produced by older versions of Ollama (and files
pulled from the Ollama registry) without re-converting or re-downloading.
The layer is applied automatically at build time via CMake FetchContent's
PATCH_COMMAND — there is no separate "apply patches" step.
Files
llama-ollama-compat.h,llama-ollama-compat.cpp— the shim itself. These are regular source files owned by Ollama; they get copied into the fetched llama.cpp source tree during configure.upstream-edits.patch— small additive edits to upstream files so the shim gets called. Currently ~48 lines touching 6 files. Kept as a realgitpatch so re-generation on upstream bumps is one command.
What the shim does
The shim runs at two well-defined points in the loader:
-
After
gguf_init_from_file, for both the main model loader and themtmd/cliploader: inspects the just-parsed metadata and decides whether the file is an Ollama-format GGUF. If so, it mutates the in-memorygguf_contextandggml_context(KV names, tensor names, tensor types) so the rest of the loader sees an upstream-shape file. -
After
load_all_data: applies any numerical fix-ups that need the tensors in their final backend buffers (e.g. RMSNorm+1if a future arch needs it — gemma3 doesn't).
Non-Ollama files are detected by the absence of Ollama-specific KV keys
(e.g. gemma3.mm.tokens_per_image) or embedded v.* / mm.* tensors in
the main model file. When no markers are present every compat function is
an immediate no-op.
Currently supported architectures
| Arch | Text loader | Clip (mmproj) loader |
|---|---|---|
gemma3 |
KV injection (layer_norm_rms_epsilon, rope.freq_base, rope.freq_base_swa), tokenizer vocab truncation, drop v.*/mm.* tensors |
Arch rewrite to clip, KV synthesis (clip.vision.*, clip.projector_type=gemma3), tensor renames (v.patch_embedding→v.patch_embd, mlp.fc{1,2}→ffn_{down,up}, etc.), F16→F32 promotion for patch/position embeddings (Metal IM2COL requirement) |
Usage:
llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob
Passing the same monolithic GGUF as both --model and --mmproj works —
each loader applies its own translation.
Additional architectures are added by implementing a handle_<arch>()
and (for vision models) handle_<arch>_clip() in llama-ollama-compat.cpp
and dispatching them from translate_metadata / translate_clip_metadata.
Regenerating upstream-edits.patch
After upstream changes the insertion points (rare), re-apply the edits to a fresh checkout and run:
cd /path/to/llama.cpp
git diff -- \
ggml/include/gguf.h \
ggml/src/gguf.cpp \
src/CMakeLists.txt \
src/llama-model-loader.cpp \
src/llama-model.cpp \
tools/mtmd/clip.cpp \
> /path/to/ollama/llama/compat/upstream-edits.patch
Why not fork llama.cpp or vendor it?
Forking means tracking upstream manually. Vendoring means snapshotting all of
llama.cpp's source in the Ollama tree (the old llama/llama.cpp/ layout).
This shim keeps upstream unmodified on disk and the Ollama-specific logic
isolated in two files plus a small diff — upstream bumps are usually just
LLAMA_CPP_VERSION changes.