Two reductions:
1. Drop the gguf_rename_tensor forwarder from gguf.h/gguf.cpp.
The rename-in-place trick it does (calling ggml_set_name on an embedded
ggml_tensor) can be done from outside gguf.cpp via:
char * p = const_cast<char *>(gguf_get_tensor_name(meta, id));
strncpy(p, new_name, GGML_MAX_NAME - 1);
That pointer aims into a mutable char[GGML_MAX_NAME] inside a std::vector
element; the const on the return type is API courtesy. This is defined
behavior and has no struct-layout dependency.
2. Drop the src/CMakeLists.txt hunk that added llama-ollama-compat.cpp to
the llama target. Replace with a target_sources() call in Ollama's
llama/server/CMakeLists.txt after FetchContent_MakeAvailable. Our
compat files now stay in llama/compat/ and are never copied into the
fetched _deps/ tree.
Net patch now touches 3 files, 20 lines, all pure call-site insertions:
src/llama-model-loader.cpp +8 (include + translate + 2x should_skip)
src/llama-model.cpp +4 (include + apply_tensor_transforms)
tools/mtmd/clip.cpp +8 (include + translate_clip + maybe_load)
Verified: fresh build from scratch (rm -rf build && cmake configure)
runs PATCH_COMMAND cleanly, compiles, and ollama run gemma3 still works
end-to-end for text + vision.
Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp
in arch names, KV keys, tensor names, and (for vision models) file layout
(text+vision in one monolithic file). This adds a self-contained compat
layer that translates those files in memory at load time, so
~/.ollama/models/blobs/* can be served by upstream llama-server with no
re-conversion and no re-download.
Structure:
llama/compat/
llama-ollama-compat.{h,cpp} — the shim (Ollama-owned, ~500 LOC)
upstream-edits.patch — ~48 lines of call-site hooks in 6 upstream files
compat.cmake — include()-able CMake fragment
README.md — what/why/how-to-regen
Integration: llama/server/CMakeLists.txt includes compat.cmake and passes
OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via
PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is
skipped so the developer's tree stays untouched.
Currently handles gemma3 (text + vision). Pattern is data-driven — adding
other archs is a new handle_<arch>() + one dispatch line. See README for
the per-arch checklist.
Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an
Ollama gemma3:latest blob answers both text prompts ("Paris") and vision
prompts (correct image descriptions).
Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample. llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.) This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.
On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.