ollama

starred/ollama

Fork 0

mirror of https://github.com/ollama/ollama.git synced 2026-04-26 02:36:09 +02:00

Commit Graph

Author	SHA1	Message	Date
jmorganca	61b367ec29	llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines) Two reductions: 1. Drop the gguf_rename_tensor forwarder from gguf.h/gguf.cpp. The rename-in-place trick it does (calling ggml_set_name on an embedded ggml_tensor) can be done from outside gguf.cpp via: char * p = const_cast<char *>(gguf_get_tensor_name(meta, id)); strncpy(p, new_name, GGML_MAX_NAME - 1); That pointer aims into a mutable char[GGML_MAX_NAME] inside a std::vector element; the const on the return type is API courtesy. This is defined behavior and has no struct-layout dependency. 2. Drop the src/CMakeLists.txt hunk that added llama-ollama-compat.cpp to the llama target. Replace with a target_sources() call in Ollama's llama/server/CMakeLists.txt after FetchContent_MakeAvailable. Our compat files now stay in llama/compat/ and are never copied into the fetched _deps/ tree. Net patch now touches 3 files, 20 lines, all pure call-site insertions: src/llama-model-loader.cpp +8 (include + translate + 2x should_skip) src/llama-model.cpp +4 (include + apply_tensor_transforms) tools/mtmd/clip.cpp +8 (include + translate_clip + maybe_load) Verified: fresh build from scratch (rm -rf build && cmake configure) runs PATCH_COMMAND cleanly, compiles, and ollama run gemma3 still works end-to-end for text + vision.	2026-04-20 09:29:34 -07:00
jmorganca	25223160d8	llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp in arch names, KV keys, tensor names, and (for vision models) file layout (text+vision in one monolithic file). This adds a self-contained compat layer that translates those files in memory at load time, so ~/.ollama/models/blobs/* can be served by upstream llama-server with no re-conversion and no re-download. Structure: llama/compat/ llama-ollama-compat.{h,cpp} — the shim (Ollama-owned, ~500 LOC) upstream-edits.patch — ~48 lines of call-site hooks in 6 upstream files compat.cmake — include()-able CMake fragment README.md — what/why/how-to-regen Integration: llama/server/CMakeLists.txt includes compat.cmake and passes OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is skipped so the developer's tree stays untouched. Currently handles gemma3 (text + vision). Pattern is data-driven — adding other archs is a new handle_<arch>() + one dispatch line. See README for the per-arch checklist. Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an Ollama gemma3:latest blob answers both text prompts ("Paris") and vision prompts (correct image descriptions).	2026-04-20 09:29:34 -07:00
Daniel Hiltgen	56c735d871	runner: Remove CGO engines, use llama-server exclusively for GGML models Remove the vendored GGML and llama.cpp backend, CGO runner, Go model implementations, and sample. llama-server (built from upstream llama.cpp via FetchContent) is now the sole inference engine for GGUF-based models. (Safetensor based models continue to run on the new MLX engine.) This allows us to more rapidly pick up new capabilities and fixes from llama.cpp as they come out. On windows this now requires recent AMD driver versions to support ROCm v7 as llama.cpp currently does not support building against v6.	2026-04-20 08:44:02 -07:00

Author

SHA1

Message

Date

jmorganca

61b367ec29

llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines)

Two reductions:

1. Drop the gguf_rename_tensor forwarder from gguf.h/gguf.cpp.
   The rename-in-place trick it does (calling ggml_set_name on an embedded
   ggml_tensor) can be done from outside gguf.cpp via:

     char * p = const_cast<char *>(gguf_get_tensor_name(meta, id));
     strncpy(p, new_name, GGML_MAX_NAME - 1);

   That pointer aims into a mutable char[GGML_MAX_NAME] inside a std::vector
   element; the const on the return type is API courtesy. This is defined
   behavior and has no struct-layout dependency.

2. Drop the src/CMakeLists.txt hunk that added llama-ollama-compat.cpp to
   the llama target. Replace with a target_sources() call in Ollama's
   llama/server/CMakeLists.txt after FetchContent_MakeAvailable. Our
   compat files now stay in llama/compat/ and are never copied into the
   fetched _deps/ tree.

Net patch now touches 3 files, 20 lines, all pure call-site insertions:
  src/llama-model-loader.cpp  +8  (include + translate + 2x should_skip)
  src/llama-model.cpp         +4  (include + apply_tensor_transforms)
  tools/mtmd/clip.cpp         +8  (include + translate_clip + maybe_load)

Verified: fresh build from scratch (rm -rf build && cmake configure)
runs PATCH_COMMAND cleanly, compiles, and ollama run gemma3 still works
end-to-end for text + vision.

2026-04-20 09:29:34 -07:00

jmorganca

25223160d8

llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs

Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp
in arch names, KV keys, tensor names, and (for vision models) file layout
(text+vision in one monolithic file). This adds a self-contained compat
layer that translates those files in memory at load time, so
~/.ollama/models/blobs/* can be served by upstream llama-server with no
re-conversion and no re-download.

Structure:
  llama/compat/
    llama-ollama-compat.{h,cpp}   — the shim (Ollama-owned, ~500 LOC)
    upstream-edits.patch          — ~48 lines of call-site hooks in 6 upstream files
    compat.cmake                  — include()-able CMake fragment
    README.md                     — what/why/how-to-regen

Integration: llama/server/CMakeLists.txt includes compat.cmake and passes
OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via
PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is
skipped so the developer's tree stays untouched.

Currently handles gemma3 (text + vision). Pattern is data-driven — adding
other archs is a new handle_<arch>() + one dispatch line. See README for
the per-arch checklist.

Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an
Ollama gemma3:latest blob answers both text prompts ("Paris") and vision
prompts (correct image descriptions).

2026-04-20 09:29:34 -07:00

Daniel Hiltgen

56c735d871

runner: Remove CGO engines, use llama-server exclusively for GGML models

Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample.  llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.)  This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.

On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.

2026-04-20 08:44:02 -07:00

3 Commits