mirror of
https://github.com/ollama/ollama.git
synced 2026-04-24 09:46:01 +02:00
Extends the compat layer with the vision side for Ollama's monolithic
qwen3.5 blobs. All changes in llama/compat/ — no new upstream patch edits.
New generic infra (reused by gemma3's existing promotion):
- LoadOp registry (g_loadops). Any dest tensor whose name is registered
gets its bytes produced by a closure instead of being read straight
from disk. maybe_load_tensor consults it.
- promote_tensor_to_f32(meta, ctx, name) now captures the source offset
at registration time and becomes a LoadOp. Gemma3 already migrated.
- register_concat_load(meta, dest, {srcs...}) captures the file offsets
of N source tensors and registers a LoadOp that concatenates them.
Assumes sources concatenate along their slowest ggml axis — which in
C order means the dest bytes are src[0] || src[1] || ... .
- set_tensor_shape / set_tensor_type helpers for in-place edits.
qwen35moe clip handler (handle_qwen35moe_clip):
- Detection reuses detect_ollama_qwen35moe; additionally requires
embedded v.* tensors so we don't fire for text-only files.
- KV synth: clip.vision.* from qwen35moe.vision.* + sensible defaults
(feed_forward_length=4304, image_size=768, layer_norm_epsilon=1e-6,
is_deepstack_layers=false[27], image_mean/std=[0.5,0.5,0.5]).
- Arch rewrite: general.architecture=clip, projector_type=qwen3vl_merger.
- QKV merge per block (27x): captures q/k/v file offsets, registers a
concat LoadOp, renames attn_q -> attn_qkv and widens its shape from
[hidden, hidden] to [hidden, 3*hidden].
- patch_embed split: source [16,16,2,3456] F16 -> two dests
[16,16,3,1152] F32, permuting (c_out*3+c_in) packed_c back into
separate c_in/c_out dims. Matches upstream convert_hf's
Qwen3VLVisionModel.modify_tensors split.
- Tensor renames (substring-matched): pos_embed -> position_embd,
merger.norm -> post_ln, merger.linear_fc1/2 -> mm.0/mm.2,
mlp.linear_fc1/2 -> ffn_up/ffn_down, norm1/2 -> ln1/ln2.
- F16 -> F32 promote for v.position_embd.weight.
Ctx-pool trick for the sibling tensor:
clip.cpp sizes its ggml_context for exactly the gguf's tensor count
(+1). ggml_new_tensor to add v.patch_embd.weight.1 overflows. Since
v.blk.0.attn_k.weight is orphaned after the QKV merge (clip only
requests the merged attn_qkv), steal that slot: rename it to
v.patch_embd.weight.1 and reshape to [16,16,3,1152] F32. Its original
file offset is ignored; the LoadOp we register overrides the read.
Go side: adds qwen35moe to the auto-mmproj arch allowlist. ollama now
passes the monolithic blob as both --model and --mmproj for qwen3.5.
Verified end-to-end: ollama run qwen3.5:35b-a3b-q4_K_M with an image
correctly describes the image ("screenshot of a chat interface...
'open the browser, open never gonna give you up on youtube'..."). Text
inference still works on the same blob.