Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp
in arch names, KV keys, tensor names, and (for vision models) file layout
(text+vision in one monolithic file). This adds a self-contained compat
layer that translates those files in memory at load time, so
~/.ollama/models/blobs/* can be served by upstream llama-server with no
re-conversion and no re-download.
Structure:
llama/compat/
llama-ollama-compat.{h,cpp} — the shim (Ollama-owned, ~500 LOC)
upstream-edits.patch — ~48 lines of call-site hooks in 6 upstream files
compat.cmake — include()-able CMake fragment
README.md — what/why/how-to-regen
Integration: llama/server/CMakeLists.txt includes compat.cmake and passes
OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via
PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is
skipped so the developer's tree stays untouched.
Currently handles gemma3 (text + vision). Pattern is data-driven — adding
other archs is a new handle_<arch>() + one dispatch line. See README for
the per-arch checklist.
Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an
Ollama gemma3:latest blob answers both text prompts ("Paris") and vision
prompts (correct image descriptions).