ollama

mirror of https://github.com/ollama/ollama.git synced 2026-04-27 03:05:43 +02:00

Files

jmorganca 1ce8a6b26d llama/compat: add qwen3-vl + qwen2.5-vl handlers

Two QwenVL families that ship with monolithic vision+text Ollama GGUFs.

qwen2.5-vl (text + clip):
  * Text: arch translation qwen25vl→qwen2vl, KV prefix copy, mrope_section
    (3 elems) → rope.dimension_sections (4 elems, padded with 0).
  * Clip: tensor renames (v.merger.* → v.post_ln/mm.0/mm.2,
    v.patch_embd_{0,1} → v.patch_embd.weight{,.1}), use_silu, derive
    n_wa_pattern from fullatt_block_indexes[0]+1, image_size=560 default,
    standard CLIP image_mean/std, F32 promote of patch_embd for Metal.

qwen3-vl (text + clip):
  * Text: inject qwen3vl.rope.dimension_sections=[24,20,20,0] (Qwen3-VL-8B
    HF default — head_dim=128, sum=64) and qwen3vl.n_deepstack_layers
    derived from deepstack_visual_indexes array length.
  * Clip: per-block QKV merge (Ollama stores separate Q/K/V; upstream
    qwen3vl_merger graph reads combined attn_qkv); deepstack remap
    v.deepstack_merger.X.* → v.deepstack.{indexes[X]}.* (with
    linear_fc{1,2} → fc{1,2}); merger renames matching upstream LLaVA
    proj layout (v.post_ln + mm.0/mm.2); patch_embed split (16x16x2
    Conv3D → two 16x16 Conv2Ds, F16→F32); per-block substring renames
    (norm1/2 → ln1/2, mlp.linear_fc1/2 → ffn_up/down); F32 promote of
    position_embd.

The qwen3-vl QKV merge needed a new util helper
(register_concat_load_to_f32) because the Ollama blob mixes types within
a single block (F16 Q/K + Q8_0 V) — the existing byte-concat
register_concat_load only works when all sources share a type. The new
helper dequantizes each source to F32 via ggml_get_type_traits->to_float
and concatenates; caller sets the destination tensor type to F32.

Verified end-to-end via `ollama run`:
  * qwen2.5vl: image of NYT moon-landing front page → correct caption.
  * qwen3-vl:  same image → "New York Times front page from July 21,
    1969, headlines the Apollo 11 moon landing with 'MEN WALK ON MOON'
    and details of astronauts collecting rocks and planting a flag."

Also added qwen25vl + qwen3vl to the compatClipArches allowlist in
llm/llama_server.go so the auto-mmproj path activates for these
monolithic blobs (same mechanism as gemma3/4/qwen35moe/etc.).

2026-04-20 09:30:26 -07:00

llama_server_test.go

runner: Remove CGO engines, use llama-server exclusively for GGML models

2026-04-20 08:44:02 -07:00

llama_server.go

llama/compat: add qwen3-vl + qwen2.5-vl handlers

2026-04-20 09:30:26 -07:00

llm_darwin.go

Optimize container images for startup (#6547 )

2024-09-12 12:10:30 -07:00

llm_linux.go

Optimize container images for startup (#6547 )