* bench: add prompt calibration, context size flag, and NumCtx reporting
Add --num-ctx flag to set context size, and report NumCtx in model info
header. Calibrate tokens-per-word ratio during warmup using actual
tokenization metrics from the model, replacing the fixed 1.3 heuristic.
This produces more accurate prompt token counts for --prompt-tokens.
Also add fetchContextLength() to query running model context via /api/ps.
* integration: improve vision test robustness and add thinking tests
Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL
is set to a non-vision model. Add Think:false to context exhaustion test
to prevent thinking models from using all context before the test can
measure it. Add third test image (ollama homepage) and replace OCR test
with ImageDescription test using it. Relax match strings for broader
model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed
to verify thinking output and channel tag handling.
* gemma4: add Gemma 4 GGML model support
Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense)
for the GGML backend including text, vision, converter, parser, and
renderer.
Text model features:
- Sliding window + full attention with per-layer patterns
- KV sharing across layers with donor map
- Per-layer embeddings (PLE) with learned projections
- MoE routing with RMSNorm + learned scale
- Proportional RoPE with freq_factors for global attention
- Final logit softcapping
Vision model features:
- SigLIP vision encoder with 2D RoPE
- ClippableLinear with input/output clamping via packed v.clamp_data
- Adaptive average pooling with nMerge kernel
- Multi-modal projection with unweighted RMSNorm
Converter:
- Safetensors to GGUF with vision tensor renaming
- Fused MoE gate_up_proj splitting
- Vision patch embedding reshape (HF to Conv2D layout)
- Packed clamp data tensor for ClippableLinear bounds
- Proportional RoPE freq_factors generation
Also includes:
- BackendGet() on ml.Tensor for reading weight tensor data
- Q6_K CUDA get_rows kernel support
- MoE-aware ffn_down quantization layer counting
- Gemma4 parser with tool calling and thinking support
- Gemma4 renderer with structured tool format
- Architecture-based auto-detection of renderer/parser/stop tokens
- Integration test gemma4 model list additions
* gemma4: add audio support with USM conformer encoder
Add audio encoding for Gemma 4 using the USM conformer architecture:
- Converter: audio tensor mapping, SSCP/conformer/embedder name replacements,
softplus repacker for per_dim_scale, F32 enforcement for conv weights
- GGML backend: Conv1DDW and PadExt tensor ops
- Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local
attention with relative position embeddings + LightConv1d + FFW),
output projection, audio-to-text embedding projector
- Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go)
- Model wiring: WAV detection, audio token handling, unified PostTokenize
Correctly transcribes "why is the sky blue" from test audio.
* integration: add gemma4 audio tests including OpenAI API coverage
Test audio transcription and response via the Ollama native API, plus
two new tests exercising the OpenAI-compatible endpoints:
- /v1/audio/transcriptions (multipart form upload)
- /v1/chat/completions with input_audio content type
All tests use capability checks and skip models without audio support.
* gemma4: add OpenAI audio API support and capability detection
- Add CapabilityAudio and detect from audio.block_count in GGUF
- Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware
- Add input_audio content type support in /v1/chat/completions
- Add TranscriptionRequest/Response types in openai package
* gemma4: add audio input support for run command
- /audio toggle in interactive mode for voice chat
- Platform-specific microphone recording (AVFoundation on macOS,
PulseAudio/ALSA on Linux, WASAPI on Windows)
- Space to start/stop recording, automatic chunking for long audio
* gemma4: add transcribe command (ollama transcribe MODEL)
- Interactive mode with readline prompt and slash commands
- Non-interactive mode for piped audio or record-until-Ctrl+C
- Chunked streaming transcription for long recordings
- Word-wrapped output matching run command style
* gemma4: add parser, renderer, and integration test plumbing
* gemma4: fix renderer to emit BOS token
* gemma4: add OpenAI audio transcription API and input_audio support
* gemma4: update converter for new weight drop naming
* gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config
* gemma4: rewrite renderer to match HF Jinja2 template exactly
Fix 8 bugs found by building 55 reference tests verified against the
HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python):
- Tool responses use separate <|turn>tool turns (not inline tags)
- Tool calls emitted before content in assistant messages
- Thinking content stripped from assistant history (strip_thinking)
- User, tool, and system content trimmed (template does | trim)
- Empty system message still emits system turn (check role, not content)
- Nested object properties rendered recursively with required field
- Array items specification rendered for array-type properties
- OBJECT/ARRAY type-specific rendering comma logic matches template
Also adds Required field to api.ToolProperty for nested object schemas,
replaces old gemma4_test.go with comprehensive gemma4_reference_test.go,
and commits the Jinja2 template as testdata for verification.
* gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing
- Text MoE: split `ffn_gate_up_exps` into contiguous `[gate|up]` halves instead of stride-2 slices.
- Parser: escape control characters in `<|"|>...<|"|>` string literals when converting tool-call args to JSON.
- Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments.
- Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`.
* cmd: simplify audio input to dropped file attachments
* gemma4: use full SWA memory for better cache reuse
* gemma4: initialize clamps after backend load
* convert: align gemma4 audio tensor renames with llama.cpp
* Remove redundant comments in gemma4 vision model
* Format Gemma4 MoE block field alignment
* use 4096 kvcache.NewSWAMemCache
* convert: support new Gemma4 audio_tower tensor naming (#15221)
Co-authored-by: jmorganca <jmorganca@gmail.com>
* fix integration test defaults for audio
* review comments and lint fixes
* remove unused audio/video files
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
In allocModel(), the first call to reserveWorstCaseGraph(true) had its
error silently discarded — `return nil` was used instead of `return err`.
This meant that if the prompt-sized graph reservation failed (e.g. due
to insufficient memory), the error was swallowed, allocModel reported
success, and the model appeared to load correctly. Subsequent inference
would then fail in unexpected ways because the worst-case graph was
never properly reserved.
Fix: return the actual error so the caller can handle the failure
(retry with reduced parallelism, report OOM, etc.).
Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>
This change adds a new MLX based runner which includes:
* Method-based MLX bindings
* Subprocess-based MLX runner (x/mlxrunner)
* KV cache with tree management
* A basic sampler
The GLM4-MoE-Lite model has been ported to use the new bindings.
---------
Co-authored-by: Michael Yang <git@mxy.ng>
When numPredict is set, the user will receive one less token
than the requested limit. In addition, the stats will incorrectly
show the number of tokens returned as the limit. In cases where
numPredict is not set, the number of tokens is reported correctly.
This occurs because numPredict is checked when setting up the next
batch but hitting the limit will terminate the current batch as well.
Instead, is is better to check the limit as we actually predict them.
If a sequence is replaced in s.seqs while a batch is computing, the old logits can be decoded into the new sequence. This change rechecks the sequence pointer after compute and skips decoding for replaced entries, preventing stale results from being applied.
Fix typo in three error messages where 'baackend' was written instead
of 'backend' in the /health endpoint handler when initializing the
dummy model load.
* flash attn: add auto mode for llama engine
If the user does not specify fa in the environment, use auto-mode.
* review comments
* ensure kv cache quantized types have FA explicitly enabled
additional review comments
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938#13054
Co-authored-by: Jesse Gross <jesse@ollama.com>
We used to control the way that llama.cpp saw devices using
CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers
offloaded to a device were actually the ones intended. This is
particularly important because we might reorder devices based on
free memory or performance.
When we started explicitly scheduling layers, this logic went
away but the llamarunner didn't have any way to set the correct
order of devices. This meant that the correct number of layers
would be assigned to a device but not necessarily the layers
that were expected. This change sets up the devices correctly
based on the offload information.
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position
Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.
The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).
There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.
Fixes#12037
We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.
This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
Currently, we only record the time for the last batch when processing
the prompt. This results in unrealistically high numbers for the
old llama runner.
Before:
total duration: 31.273112939s
load duration: 4.97054657s
prompt eval count: 32768 token(s)
prompt eval duration: 235.137439ms
prompt eval rate: 139356.80 tokens/s
eval count: 1873 token(s)
eval duration: 18.173182374s
eval rate: 103.06 tokens/s
After:
total duration: 30.024798033s
load duration: 4.758588663s
prompt eval count: 32768 token(s)
prompt eval duration: 7.779621548s
prompt eval rate: 4212.03 tokens/s
eval count: 1769 token(s)
eval duration: 17.148014223s
eval rate: 103.16 tokens/s
* feat: Bump llama.cpp to df1b612
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(mtmd): Correctly encode text chunks during mtmd tokenization
There can be text chunks that appear interspersed with the image embeddings
that contain template delimiter tokens for some models. These need to be
correctly translated to text tokens.
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* tests: Use MtmdChunk in image_test
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Fix unnecessary conversion linting
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(ggml): Revert changes to ggml_hip.cpp
These changes were done largely by our code assistant and are likely wrong
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Revert changes in mem_nvml.cpp
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update sync point to 1deee0
This brings in several more optimization commits and model support for
EmbeddingGemma
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches for 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: sync for bump to 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Bad patch updates with errant `+`
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Bump llama.cpp/ggml to 7049736
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: format-patches after latest bump
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
hardErrCh will deadlock since forwardBatch is blocked on
computeStartedCh which never gets sent. since the response to
hardErrCh is to panic, just panic instead
this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.
while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.
this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools
This revamps how we discover GPUs in the system by leveraging the Ollama
runner. This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs. Now the runner does that implicitly based on the actual
device list. In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.
Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.
Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)
* perf: build graph for next batch in parallel to keep GPU busy
This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.
* tests: tune integration tests for ollama engine
This tunes the integration tests to focus more on models supported
by the new engine.
This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).
It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.