mirror of
https://github.com/ollama/ollama.git
synced 2026-04-24 17:55:43 +02:00
Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp
in arch names, KV keys, tensor names, and (for vision models) file layout
(text+vision in one monolithic file). This adds a self-contained compat
layer that translates those files in memory at load time, so
~/.ollama/models/blobs/* can be served by upstream llama-server with no
re-conversion and no re-download.
Structure:
llama/compat/
llama-ollama-compat.{h,cpp} — the shim (Ollama-owned, ~500 LOC)
upstream-edits.patch — ~48 lines of call-site hooks in 6 upstream files
compat.cmake — include()-able CMake fragment
README.md — what/why/how-to-regen
Integration: llama/server/CMakeLists.txt includes compat.cmake and passes
OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via
PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is
skipped so the developer's tree stays untouched.
Currently handles gemma3 (text + vision). Pattern is data-driven — adding
other archs is a new handle_<arch>() + one dispatch line. See README for
the per-arch checklist.
Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an
Ollama gemma3:latest blob answers both text prompts ("Paris") and vision
prompts (correct image descriptions).
81 lines
3.3 KiB
Markdown
81 lines
3.3 KiB
Markdown
# llama.cpp compatibility shim
|
|
|
|
This directory holds an in-process compatibility layer that lets upstream
|
|
`llama-server` load GGUFs produced by older versions of Ollama (and files
|
|
pulled from the Ollama registry) without re-converting or re-downloading.
|
|
|
|
The layer is applied automatically at build time via CMake `FetchContent`'s
|
|
`PATCH_COMMAND` — there is no separate "apply patches" step.
|
|
|
|
## Files
|
|
|
|
- `llama-ollama-compat.h`, `llama-ollama-compat.cpp` — the shim itself. These
|
|
are regular source files owned by Ollama; they get copied into the fetched
|
|
llama.cpp source tree during configure.
|
|
- `upstream-edits.patch` — small additive edits to upstream files so the
|
|
shim gets called. Currently ~48 lines touching 6 files. Kept as a real
|
|
`git` patch so re-generation on upstream bumps is one command.
|
|
|
|
## What the shim does
|
|
|
|
The shim runs at two well-defined points in the loader:
|
|
|
|
1. **After `gguf_init_from_file`**, for both the main model loader and the
|
|
`mtmd/clip` loader: inspects the just-parsed metadata and decides whether
|
|
the file is an Ollama-format GGUF. If so, it mutates the in-memory
|
|
`gguf_context` and `ggml_context` (KV names, tensor names, tensor types)
|
|
so the rest of the loader sees an upstream-shape file.
|
|
|
|
2. **After `load_all_data`**: applies any numerical fix-ups that need the
|
|
tensors in their final backend buffers (e.g. RMSNorm `+1` if a future
|
|
arch needs it — gemma3 doesn't).
|
|
|
|
Non-Ollama files are detected by the absence of Ollama-specific KV keys
|
|
(e.g. `gemma3.mm.tokens_per_image`) or embedded `v.*` / `mm.*` tensors in
|
|
the main model file. When no markers are present every compat function is
|
|
an immediate no-op.
|
|
|
|
## Currently supported architectures
|
|
|
|
| Arch | Text loader | Clip (mmproj) loader |
|
|
|---|---|---|
|
|
| `gemma3` | KV injection (`layer_norm_rms_epsilon`, `rope.freq_base`, `rope.freq_base_swa`), tokenizer vocab truncation, drop `v.*`/`mm.*` tensors | Arch rewrite to `clip`, KV synthesis (`clip.vision.*`, `clip.projector_type=gemma3`), tensor renames (`v.patch_embedding`→`v.patch_embd`, `mlp.fc{1,2}`→`ffn_{down,up}`, etc.), F16→F32 promotion for patch/position embeddings (Metal IM2COL requirement) |
|
|
|
|
Usage:
|
|
|
|
```
|
|
llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob
|
|
```
|
|
|
|
Passing the same monolithic GGUF as both `--model` and `--mmproj` works —
|
|
each loader applies its own translation.
|
|
|
|
Additional architectures are added by implementing a `handle_<arch>()`
|
|
and (for vision models) `handle_<arch>_clip()` in `llama-ollama-compat.cpp`
|
|
and dispatching them from `translate_metadata` / `translate_clip_metadata`.
|
|
|
|
## Regenerating `upstream-edits.patch`
|
|
|
|
After upstream changes the insertion points (rare), re-apply the edits to
|
|
a fresh checkout and run:
|
|
|
|
```
|
|
cd /path/to/llama.cpp
|
|
git diff -- \
|
|
ggml/include/gguf.h \
|
|
ggml/src/gguf.cpp \
|
|
src/CMakeLists.txt \
|
|
src/llama-model-loader.cpp \
|
|
src/llama-model.cpp \
|
|
tools/mtmd/clip.cpp \
|
|
> /path/to/ollama/llama/compat/upstream-edits.patch
|
|
```
|
|
|
|
## Why not fork llama.cpp or vendor it?
|
|
|
|
Forking means tracking upstream manually. Vendoring means snapshotting all of
|
|
llama.cpp's source in the Ollama tree (the old `llama/llama.cpp/` layout).
|
|
This shim keeps upstream unmodified on disk and the Ollama-specific logic
|
|
isolated in two files plus a small diff — upstream bumps are usually just
|
|
`LLAMA_CPP_VERSION` changes.
|