ollama

mirror of https://github.com/ollama/ollama.git synced 2026-04-17 21:54:08 +02:00

Author	SHA1	Message	Date
Daniel Hiltgen	de9673ac3f	tokenizer: add byte fallback for SentencePiece BPE encoding (#15232 ) * tokenizer: add byte fallback for SentencePiece BPE encoding When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes. Fixes #15229, fixes #15231 * tokenizer fixes	2026-04-02 13:04:45 -07:00
Daniel Hiltgen	96b202d34b	Add support for gemma4 (#15214 ) * bench: add prompt calibration, context size flag, and NumCtx reporting Add --num-ctx flag to set context size, and report NumCtx in model info header. Calibrate tokens-per-word ratio during warmup using actual tokenization metrics from the model, replacing the fixed 1.3 heuristic. This produces more accurate prompt token counts for --prompt-tokens. Also add fetchContextLength() to query running model context via /api/ps. * integration: improve vision test robustness and add thinking tests Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL is set to a non-vision model. Add Think:false to context exhaustion test to prevent thinking models from using all context before the test can measure it. Add third test image (ollama homepage) and replace OCR test with ImageDescription test using it. Relax match strings for broader model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed to verify thinking output and channel tag handling. * gemma4: add Gemma 4 GGML model support Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense) for the GGML backend including text, vision, converter, parser, and renderer. Text model features: - Sliding window + full attention with per-layer patterns - KV sharing across layers with donor map - Per-layer embeddings (PLE) with learned projections - MoE routing with RMSNorm + learned scale - Proportional RoPE with freq_factors for global attention - Final logit softcapping Vision model features: - SigLIP vision encoder with 2D RoPE - ClippableLinear with input/output clamping via packed v.clamp_data - Adaptive average pooling with nMerge kernel - Multi-modal projection with unweighted RMSNorm Converter: - Safetensors to GGUF with vision tensor renaming - Fused MoE gate_up_proj splitting - Vision patch embedding reshape (HF to Conv2D layout) - Packed clamp data tensor for ClippableLinear bounds - Proportional RoPE freq_factors generation Also includes: - BackendGet() on ml.Tensor for reading weight tensor data - Q6_K CUDA get_rows kernel support - MoE-aware ffn_down quantization layer counting - Gemma4 parser with tool calling and thinking support - Gemma4 renderer with structured tool format - Architecture-based auto-detection of renderer/parser/stop tokens - Integration test gemma4 model list additions * gemma4: add audio support with USM conformer encoder Add audio encoding for Gemma 4 using the USM conformer architecture: - Converter: audio tensor mapping, SSCP/conformer/embedder name replacements, softplus repacker for per_dim_scale, F32 enforcement for conv weights - GGML backend: Conv1DDW and PadExt tensor ops - Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local attention with relative position embeddings + LightConv1d + FFW), output projection, audio-to-text embedding projector - Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go) - Model wiring: WAV detection, audio token handling, unified PostTokenize Correctly transcribes "why is the sky blue" from test audio. * integration: add gemma4 audio tests including OpenAI API coverage Test audio transcription and response via the Ollama native API, plus two new tests exercising the OpenAI-compatible endpoints: - /v1/audio/transcriptions (multipart form upload) - /v1/chat/completions with input_audio content type All tests use capability checks and skip models without audio support. * gemma4: add OpenAI audio API support and capability detection - Add CapabilityAudio and detect from audio.block_count in GGUF - Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware - Add input_audio content type support in /v1/chat/completions - Add TranscriptionRequest/Response types in openai package * gemma4: add audio input support for run command - /audio toggle in interactive mode for voice chat - Platform-specific microphone recording (AVFoundation on macOS, PulseAudio/ALSA on Linux, WASAPI on Windows) - Space to start/stop recording, automatic chunking for long audio * gemma4: add transcribe command (ollama transcribe MODEL) - Interactive mode with readline prompt and slash commands - Non-interactive mode for piped audio or record-until-Ctrl+C - Chunked streaming transcription for long recordings - Word-wrapped output matching run command style * gemma4: add parser, renderer, and integration test plumbing * gemma4: fix renderer to emit BOS token * gemma4: add OpenAI audio transcription API and input_audio support * gemma4: update converter for new weight drop naming * gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config * gemma4: rewrite renderer to match HF Jinja2 template exactly Fix 8 bugs found by building 55 reference tests verified against the HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python): - Tool responses use separate <\|turn>tool turns (not inline tags) - Tool calls emitted before content in assistant messages - Thinking content stripped from assistant history (strip_thinking) - User, tool, and system content trimmed (template does \| trim) - Empty system message still emits system turn (check role, not content) - Nested object properties rendered recursively with required field - Array items specification rendered for array-type properties - OBJECT/ARRAY type-specific rendering comma logic matches template Also adds Required field to api.ToolProperty for nested object schemas, replaces old gemma4_test.go with comprehensive gemma4_reference_test.go, and commits the Jinja2 template as testdata for verification. * gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing - Text MoE: split `ffn_gate_up_exps` into contiguous `[gate\|up]` halves instead of stride-2 slices. - Parser: escape control characters in `<\|"\|>...<\|"\|>` string literals when converting tool-call args to JSON. - Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments. - Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`. * cmd: simplify audio input to dropped file attachments * gemma4: use full SWA memory for better cache reuse * gemma4: initialize clamps after backend load * convert: align gemma4 audio tensor renames with llama.cpp * Remove redundant comments in gemma4 vision model * Format Gemma4 MoE block field alignment * use 4096 kvcache.NewSWAMemCache * convert: support new Gemma4 audio_tower tensor naming (#15221) Co-authored-by: jmorganca <jmorganca@gmail.com> * fix integration test defaults for audio * review comments and lint fixes * remove unused audio/video files --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-04-02 11:33:33 -07:00
Jeffrey Morgan	b7bda92d52	model: add qwen3-next compatibility for legacy ssm_in projections (#15133 )	2026-03-29 11:50:47 -07:00
Jeffrey Morgan	3490e9590b	model/qwen3next: avoid crash in in DeltaNet when offloading (#14541 ) Co-authored-by: Yossi Ovadia <jabadia@gmail.com>	2026-03-01 18:44:04 -08:00
Jeffrey Morgan	8da09b1e7e	qwen3next: add compatibility with imported GGUF models (#14517 )	2026-02-28 14:21:42 -08:00
Jeffrey Morgan	7f9efd53df	model: add support for qwen3.5-27b model (#14415 )	2026-02-25 01:09:58 -08:00
Jeffrey Morgan	da70c3222e	model: support for qwen3.5 architecture (#14378 )	2026-02-24 20:08:05 -08:00
Jeffrey Morgan	4b2ac1f369	model: improvements to LFM architectures (#14368 )	2026-02-23 14:38:10 -08:00
Jeffrey Morgan	0ade9205cc	models: add nemotronh architecture support (#14356 )	2026-02-22 15:09:14 -08:00
Michael Yang	f1373193dc	move tokenizers to separate package (#13825 )	2026-02-05 17:44:11 -08:00
Jeffrey Morgan	d25535c3f3	qwen3next: avoid inplace sigmoid for shared gate (#14077 )	2026-02-04 15:50:02 -08:00
Jeffrey Morgan	255579aaa7	qwen3next: fix issue in delta net (#14075 ) gDiffExp was being broadcast across the wrong axis when multiplying with k. This fix reshapes gDiffExp to [1, chunkSize, nChunks, ...]	2026-02-04 13:40:38 -08:00
Jeffrey Morgan	77eb2ca619	model: add qwen3-next architecture (#14051 )	2026-02-03 23:27:21 -08:00
Jeffrey Morgan	8f4a008139	Add GLM-OCR vision model support (#14024 )	2026-02-02 15:39:18 -08:00
Jeffrey Morgan	a1ca428c90	glm4moelite: fix attention scale calculation (#13893 ) Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for the attention scale instead of the MLA absorbed dimension (kvLoraRank + qkRopeHeadDim = 576). MLA absorption is a mathematically equivalent reorganization of the attention computation - it should not change the effective attention scale. The scale should match training, which uses 1/sqrt(256). This improves tool calling and model looping issues.	2026-01-24 17:48:09 -08:00
Jeffrey Morgan	16750865d1	glm4moelite: quantize more tensors to q8_0 and avoid double BOS token (#13891 )	2026-01-24 16:33:54 -08:00
Jeffrey Morgan	64737330a4	Re-apply "model: add MLA absorption for glm4moelite" with fix (#13870 ) The nvidia_fp32 config for (576, 512) head sizes had nbatch_fa=32, which caused zero-sized arrays when computing array dimensions: nbatch_fa / (np * warp_size) = 32 / (2 * 32) = 0 This resulted in CUDA compilation failures on CUDA 12 (Windows and Linux arm64): - "static assertion failed with nbatch_fa % (np*warp_size) != 0" - "the size of an array must be greater than zero" Fix by changing nbatch_fa from 32 to 64 for all (576, 512) configs in the nvidia_fp32 function, matching the nvidia_fp16 and AMD configs.	2026-01-23 18:40:28 -08:00
Jeffrey Morgan	2eda97f1c3	Revert "model: add MLA absorption for glm4moelite (#13810 )" (#13869 ) This reverts commit `1044b0419a`.	2026-01-23 17:14:15 -08:00
Jeffrey Morgan	1044b0419a	model: add MLA absorption for glm4moelite (#13810 ) * model: add MLA absorption for glm4moelite Split the combined KV_B tensor into separate K_B and V_B tensors during conversion, enabling MLA (Multi-head Latent Attention) absorption which compresses the KV cache for improved efficiency. * ggml: enable MLA flash attention for GLM-4.7-flash Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash uses head size 576 with gqa_ratio 4, which was previously only supported for gqa_ratio 16 (DeepSeek). Metal changes: - Enable head size 576 for flash attention - Increase simdgroups to 8 for large heads (>=512) - Add case 8 kernel dispatch for 8 simdgroups CUDA changes: - Add gqa_ratio 4 support for head 576/512 - Add tile configs for (576, 512, 4) and (576, 512, 8) - Add MMA config cases for ncols 4 - Add template instances for ncols2=4 * model: add compatibility validation for glm4moelite architecture	2026-01-23 14:47:42 -08:00
Jeffrey Morgan	01cf7445f3	model: add lfm2 architecture and LFM2.5-1.2B-Thinking support (#13792 ) Co-Authored-By: TommyBoiss <165361500+TommyBoiss@users.noreply.github.com>	2026-01-20 12:20:53 -08:00
Jeffrey Morgan	4f138a1749	model: add `Glm4MoeLiteForCausalLM` architecture to support GLM-4.7-Flash (#13779 )	2026-01-19 12:47:17 -08:00
Michael Yang	f6a016f49d	revert granite-embedding (#13505 )	2025-12-16 15:44:52 -08:00
Michael Yang	903b1fc97f	use ollama engine for bert models (#13501 ) register bpe tokenizer which enables granite-embedding	2025-12-16 11:29:19 -08:00
Michael Yang	971d62595a	fix: qwen2.5 vl rope (#13486 ) * qwen25vl: bump max pixels * qwen25vl: mrope fix qwen2.5vl window * qwen25vl: vision rope	2025-12-15 17:30:33 -08:00
Parth Sareen	ffbe8e076d	model: add olmo3 and olmo3.1 (#13415 )	2025-12-15 15:20:04 -08:00
Jeffrey Morgan	4ff8a691bc	model: default gemma 3 rope scale to 1.0, apply corrections based on layer counts (#13453 )	2025-12-12 17:51:56 -08:00
Jeffrey Morgan	1b308e1d2a	model: fix global layer rope scale values for gemma 3 (#13452 )	2025-12-12 16:29:01 -08:00
Jeffrey Morgan	3af5d3b738	model: force rope factor 1.0 for Gemma 3 (#13445 )	2025-12-12 13:27:08 -08:00
Jeffrey Morgan	2dfb74410d	model: fix rotary embeddings for ministral 3 (#13432 )	2025-12-11 16:02:05 -08:00
Jeffrey Morgan	a838421ea3	model: conversion and hyperparameter fixes for ministral and devstral (#13424 )	2025-12-11 13:04:00 -08:00
nicole pardal	76f88caf43	nomic-embed-text:v2: model implementation (#13162 )	2025-12-09 14:24:51 -08:00
Jeffrey Morgan	d2f334c1f7	model: add rnj-1 inference support (#13354 )	2025-12-08 16:49:17 -08:00
Michael Yang	603ceefaa6	refactor rope change to a flatter directory structure and group the options with the function update models to call rope in one place	2025-12-08 14:42:22 -08:00
Patrick Devine	d3e0a0dee4	model: ministral w/ llama4 scaling (#13292 ) This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2025-12-01 23:20:14 -08:00
Michael Yang	5c1063df7f	deepseek2: upgrade to run v3+ models (#13166 ) the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility	2025-11-19 17:05:39 -08:00
Patrick Devine	604e43b28d	models: enable deepseek2 (deepseek v3.1 w/ MLA) on the new engine (#13151 )	2025-11-18 22:03:50 -08:00
nicole pardal	8de30b568a	nomic-embed-text model implementation (#13071 )	2025-11-18 18:28:10 -08:00
Michael Yang	92981ae3f2	deepseekocr	2025-11-18 16:11:37 -08:00
Grace	584e2d646f	Add deepseek v3.1 (#13063 ) * Add mla for flash attention * Revert to using chunks	2025-11-17 18:03:21 -08:00
Michael Yang	333203d871	chore: update models to use slice/chunk/chunksections (#12934 ) * use slice/chunks * bert * llama4 * gemma3n * gptoss * mistral3 * qwen3vl * qwen25vl * deepseek2 * remove unused ops	2025-11-13 15:20:12 -08:00
Daniel Hiltgen	544b6739dd	ggml update to b6840 (#12791 )	2025-11-06 10:19:22 -08:00
Michael Yang	ce3eb0a315	chore(gptoss): cleanup dead code (#12932 )	2025-11-03 11:27:15 -08:00
Michael Yang	f67a6df110	interleaved mrope (#12807 ) * ml(ggml): mrope * interleave mrope	2025-10-30 11:29:00 -07:00
Michael Yang	d432ade714	fix: qwen2.5vl, qwen3vl composite image (#12841 ) this change fixes images with an alpha channel by overlaying the image onto a white background	2025-10-30 10:33:19 -07:00
Michael Yang	7d25b9e194	feat(model): add qwen3vl (#12665 )	2025-10-28 17:39:47 -07:00
Michael Yang	1188f408dd	s/FromSlice/Froms/ (#12255 )	2025-10-28 12:08:49 -07:00
Michael Yang	ec9eb28f4c	gemma3: make embedding non-causal (#12297 )	2025-10-27 19:54:08 -07:00
Daniel Hiltgen	bc1a818fdc	contiguous input per layer (#12686 ) Co-authored-by: Michael Yang <git@mxy.ng>	2025-10-17 18:39:18 -07:00
Michael Yang	6c833d5f8d	fix(qwen3): deepseek distill deepseek's qwen3 distill uses a different rope scheme so support both	2025-10-13 13:30:30 -07:00
shengxinjing	47298fce39	refactor: use builtin max and min	2025-10-09 16:17:52 -07:00

1 2 3

139 Commits