Files
ollama/llm
jmorganca db0c745308 llama/compat: add qwen35moe vision (clip) support
Extends the compat layer with the vision side for Ollama's monolithic
qwen3.5 blobs. All changes in llama/compat/ — no new upstream patch edits.

New generic infra (reused by gemma3's existing promotion):
  - LoadOp registry (g_loadops). Any dest tensor whose name is registered
    gets its bytes produced by a closure instead of being read straight
    from disk. maybe_load_tensor consults it.
  - promote_tensor_to_f32(meta, ctx, name) now captures the source offset
    at registration time and becomes a LoadOp. Gemma3 already migrated.
  - register_concat_load(meta, dest, {srcs...}) captures the file offsets
    of N source tensors and registers a LoadOp that concatenates them.
    Assumes sources concatenate along their slowest ggml axis — which in
    C order means the dest bytes are src[0] || src[1] || ... .
  - set_tensor_shape / set_tensor_type helpers for in-place edits.

qwen35moe clip handler (handle_qwen35moe_clip):
  - Detection reuses detect_ollama_qwen35moe; additionally requires
    embedded v.* tensors so we don't fire for text-only files.
  - KV synth: clip.vision.* from qwen35moe.vision.* + sensible defaults
    (feed_forward_length=4304, image_size=768, layer_norm_epsilon=1e-6,
    is_deepstack_layers=false[27], image_mean/std=[0.5,0.5,0.5]).
  - Arch rewrite: general.architecture=clip, projector_type=qwen3vl_merger.
  - QKV merge per block (27x): captures q/k/v file offsets, registers a
    concat LoadOp, renames attn_q -> attn_qkv and widens its shape from
    [hidden, hidden] to [hidden, 3*hidden].
  - patch_embed split: source [16,16,2,3456] F16 -> two dests
    [16,16,3,1152] F32, permuting (c_out*3+c_in) packed_c back into
    separate c_in/c_out dims. Matches upstream convert_hf's
    Qwen3VLVisionModel.modify_tensors split.
  - Tensor renames (substring-matched): pos_embed -> position_embd,
    merger.norm -> post_ln, merger.linear_fc1/2 -> mm.0/mm.2,
    mlp.linear_fc1/2 -> ffn_up/ffn_down, norm1/2 -> ln1/ln2.
  - F16 -> F32 promote for v.position_embd.weight.

Ctx-pool trick for the sibling tensor:
  clip.cpp sizes its ggml_context for exactly the gguf's tensor count
  (+1). ggml_new_tensor to add v.patch_embd.weight.1 overflows. Since
  v.blk.0.attn_k.weight is orphaned after the QKV merge (clip only
  requests the merged attn_qkv), steal that slot: rename it to
  v.patch_embd.weight.1 and reshape to [16,16,3,1152] F32. Its original
  file offset is ignored; the LoadOp we register overrides the read.

Go side: adds qwen35moe to the auto-mmproj arch allowlist. ollama now
passes the monolithic blob as both --model and --mmproj for qwen3.5.

Verified end-to-end: ollama run qwen3.5:35b-a3b-q4_K_M with an image
correctly describes the image ("screenshot of a chat interface...
'open the browser, open never gonna give you up on youtube'..."). Text
inference still works on the same blob.
2026-04-20 09:29:34 -07:00
..
2025-05-05 11:08:12 -07:00
2025-10-31 09:54:25 -07:00