Commit Graph

335 Commits

Author SHA1 Message Date
Daniel Hiltgen
96b202d34b Add support for gemma4 (#15214)
* bench: add prompt calibration, context size flag, and NumCtx reporting

Add --num-ctx flag to set context size, and report NumCtx in model info
header. Calibrate tokens-per-word ratio during warmup using actual
tokenization metrics from the model, replacing the fixed 1.3 heuristic.
This produces more accurate prompt token counts for --prompt-tokens.

Also add fetchContextLength() to query running model context via /api/ps.

* integration: improve vision test robustness and add thinking tests

Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL
is set to a non-vision model. Add Think:false to context exhaustion test
to prevent thinking models from using all context before the test can
measure it. Add third test image (ollama homepage) and replace OCR test
with ImageDescription test using it. Relax match strings for broader
model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed
to verify thinking output and channel tag handling.

* gemma4: add Gemma 4 GGML model support

Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense)
for the GGML backend including text, vision, converter, parser, and
renderer.

Text model features:
- Sliding window + full attention with per-layer patterns
- KV sharing across layers with donor map
- Per-layer embeddings (PLE) with learned projections
- MoE routing with RMSNorm + learned scale
- Proportional RoPE with freq_factors for global attention
- Final logit softcapping

Vision model features:
- SigLIP vision encoder with 2D RoPE
- ClippableLinear with input/output clamping via packed v.clamp_data
- Adaptive average pooling with nMerge kernel
- Multi-modal projection with unweighted RMSNorm

Converter:
- Safetensors to GGUF with vision tensor renaming
- Fused MoE gate_up_proj splitting
- Vision patch embedding reshape (HF to Conv2D layout)
- Packed clamp data tensor for ClippableLinear bounds
- Proportional RoPE freq_factors generation

Also includes:
- BackendGet() on ml.Tensor for reading weight tensor data
- Q6_K CUDA get_rows kernel support
- MoE-aware ffn_down quantization layer counting
- Gemma4 parser with tool calling and thinking support
- Gemma4 renderer with structured tool format
- Architecture-based auto-detection of renderer/parser/stop tokens
- Integration test gemma4 model list additions

* gemma4: add audio support with USM conformer encoder

Add audio encoding for Gemma 4 using the USM conformer architecture:
- Converter: audio tensor mapping, SSCP/conformer/embedder name replacements,
  softplus repacker for per_dim_scale, F32 enforcement for conv weights
- GGML backend: Conv1DDW and PadExt tensor ops
- Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local
  attention with relative position embeddings + LightConv1d + FFW),
  output projection, audio-to-text embedding projector
- Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go)
- Model wiring: WAV detection, audio token handling, unified PostTokenize

Correctly transcribes "why is the sky blue" from test audio.

* integration: add gemma4 audio tests including OpenAI API coverage

Test audio transcription and response via the Ollama native API, plus
two new tests exercising the OpenAI-compatible endpoints:
- /v1/audio/transcriptions (multipart form upload)
- /v1/chat/completions with input_audio content type

All tests use capability checks and skip models without audio support.

* gemma4: add OpenAI audio API support and capability detection

- Add CapabilityAudio and detect from audio.block_count in GGUF
- Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware
- Add input_audio content type support in /v1/chat/completions
- Add TranscriptionRequest/Response types in openai package

* gemma4: add audio input support for run command

- /audio toggle in interactive mode for voice chat
- Platform-specific microphone recording (AVFoundation on macOS,
  PulseAudio/ALSA on Linux, WASAPI on Windows)
- Space to start/stop recording, automatic chunking for long audio

* gemma4: add transcribe command (ollama transcribe MODEL)

- Interactive mode with readline prompt and slash commands
- Non-interactive mode for piped audio or record-until-Ctrl+C
- Chunked streaming transcription for long recordings
- Word-wrapped output matching run command style

* gemma4: add parser, renderer, and integration test plumbing

* gemma4: fix renderer to emit BOS token

* gemma4: add OpenAI audio transcription API and input_audio support

* gemma4: update converter for new weight drop naming

* gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config

* gemma4: rewrite renderer to match HF Jinja2 template exactly

Fix 8 bugs found by building 55 reference tests verified against the
HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python):

- Tool responses use separate <|turn>tool turns (not inline tags)
- Tool calls emitted before content in assistant messages
- Thinking content stripped from assistant history (strip_thinking)
- User, tool, and system content trimmed (template does | trim)
- Empty system message still emits system turn (check role, not content)
- Nested object properties rendered recursively with required field
- Array items specification rendered for array-type properties
- OBJECT/ARRAY type-specific rendering comma logic matches template

Also adds Required field to api.ToolProperty for nested object schemas,
replaces old gemma4_test.go with comprehensive gemma4_reference_test.go,
and commits the Jinja2 template as testdata for verification.

* gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing

- Text MoE: split `ffn_gate_up_exps` into contiguous `[gate|up]` halves instead of stride-2 slices.
- Parser: escape control characters in `<|"|>...<|"|>` string literals when converting tool-call args to JSON.
- Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments.
- Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`.

* cmd: simplify audio input to dropped file attachments

* gemma4: use full SWA memory for better cache reuse

* gemma4: initialize clamps after backend load

* convert: align gemma4 audio tensor renames with llama.cpp

* Remove redundant comments in gemma4 vision model

* Format Gemma4 MoE block field alignment

* use 4096 kvcache.NewSWAMemCache

* convert: support new Gemma4 audio_tower tensor naming (#15221)

Co-authored-by: jmorganca <jmorganca@gmail.com>

* fix integration test defaults for audio

* review comments and lint fixes

* remove unused audio/video files

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2026-04-02 11:33:33 -07:00
Eva H
7575438366 cmd: ollama launch vscode (#15060)
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2026-03-25 16:37:02 -04:00
Parth Sareen
f8b657c967 cmd/launch: add guards for headless mode (#14837) 2026-03-14 00:10:02 -07:00
Devon Rifkin
9e7ba835da cmd: still populate ollama ls when using ollama run <model:cloud> (#14824)
This is temporary until `api/tags` supports cloud natively
2026-03-13 12:24:45 -07:00
Parth Sareen
af5f7c0a9e cmd: refactor tui and launch (#14609) 2026-03-12 18:39:06 -07:00
Patrick Devine
3e06bde643 mlx: get parameters from modelfile during model creation (#14747) 2026-03-09 15:33:24 -07:00
Daniel Hiltgen
4425c54eda create: fix localhost handling (#14681) 2026-03-06 16:35:58 -08:00
Jeffrey Morgan
4eab60c1e2 Reapply "don't require pulling stubs for cloud models" again (#14608)
* Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)"

This reverts commit 39982a954e.

* fix test + do cloud lookup only when seeing cloud models

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-06 14:27:47 -08:00
Daniel Hiltgen
6928630601 mlx: prevent remote creation mismatch (#14651)
If the user is pointing at a remote OLLAMA_HOST, fail experimental safetensor
based create operations as we only support local creation currently.
2026-03-05 14:59:00 -08:00
Jeffrey Morgan
39982a954e Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)
This reverts commit 799e51d419.
2026-03-03 20:56:10 -08:00
Patrick Devine
110eff01a9 chore: remove old imagegen LLMs models (#14597)
These models are implemented in the x/mlxrunner instead.
2026-03-03 13:23:40 -08:00
Jeffrey Morgan
799e51d419 Reapply "don't require pulling stubs for cloud models"
This reverts commit 97d2f05a6d.
2026-03-03 13:17:10 -08:00
Jeffrey Morgan
97d2f05a6d Revert "don't require pulling stubs for cloud models (#14574)" (#14596)
This reverts commit 8207e55ec7.
2026-03-03 12:51:23 -08:00
Devon Rifkin
8207e55ec7 don't require pulling stubs for cloud models (#14574)
* don't require pulling stubs for cloud models

This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.

There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).

There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.

More detailed changes:

- Added a shared model selector parser in `types/modelselector`:
  - supports `:cloud` and `:local`
  - accepts source tags in any position
  - supports legacy `:<tag>-cloud`
  - rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
  - `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
    `EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
  - same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
  - normalizes `model` (and `name` for `/api/show`) before forwarding
  - forwards request headers except hop-by-hop/proxy-managed headers
  - uses bounded response-header timeout
  - handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
  this flow uses the legacy proxy still, supporting Modelfile overrides
  is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
  auto-pull local stubs

What's next?

- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
  request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow

Fixes: https://github.com/ollama/ollama/issues/13801

* consolidate pull logic into confirmAndPull helper

pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.

* skip local existence checks for cloud models

ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.

* support optionally pulling stubs for newly-style names

We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.

* Fix server alias syncing

* Update cmd/cmd.go

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

* address comments

* improve some naming

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-03 10:46:33 -08:00
Parth Sareen
3445223311 cmd: openclaw onboarding (#14344) 2026-02-20 19:08:38 -08:00
Patrick Devine
9aefd2dfee model: add qwen3 support to mlxrunner (#14293) 2026-02-17 13:58:49 -08:00
Parth Sareen
2f4de1acf7 cmd: ollama launch always show model picker (#14299) 2026-02-17 12:02:14 -08:00
Patrick Devine
d18dcd7775 mlxrunner fixes (#14247)
* load glm4_moe_lite from the mlxrunner

* fix loading diffusion models

* remove log lines

* fix --imagegen flag
2026-02-13 22:30:42 -08:00
Parth Sareen
f0a07a353b cmd/tui: fix powershell search (#14242) 2026-02-13 15:53:11 -08:00
Devon Rifkin
948de6bbd2 add ability to disable cloud (#14221)
* add ability to disable cloud

Users can now easily opt-out of cloud inference and web search by
setting

```
"disable_ollama_cloud": true
```

in their `~/.ollama/server.json` settings file. After a setting update,
the server must be restarted.

Alternatively, setting the environment variable `OLLAMA_NO_CLOUD=1` will
also disable cloud features. While users previously were able to avoid
cloud models by not pulling or `ollama run`ing them, this gives them an
easy way to enforce that decision. Any attempt to run a cloud model when
cloud is disabled will fail.

The app's old "airplane mode" setting, which did a similar thing for
hiding cloud models within the app is now unified with this new cloud
disabled mode. That setting has been replaced with a "Cloud" toggle,
which behind the scenes edits `server.json` and then restarts the
server.

* gate cloud models across TUI and launch flows when cloud is disabled

Block cloud models from being selected, launched, or written to
integration configs when cloud mode is turned off:

- TUI main menu: open model picker instead of launching with a
  disabled cloud model
- cmd.go: add IsCloudModelDisabled checks for all Selection* paths
- LaunchCmd: filter cloud models from saved Editor configs before
  launch, fall through to picker if none remain
- Editor Run() methods (droid, opencode, openclaw): filter cloud
  models before calling Edit() and persist the cleaned list
- Export SaveIntegration, remove SaveIntegrationModel wrapper that
  was accumulating models instead of replacing them

* rename saveIntegration to SaveIntegration in config.go and tests

* cmd/config: add --model guarding and empty model list fixes

* Update docs/faq.mdx

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update internal/cloud/policy.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update internal/cloud/policy.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update server/routes.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Revert "Update internal/cloud/policy.go"

This reverts commit 8bff8615f9.

Since this error shows up in other integrations, we want it to be
prefixed with Ollama

* rename cloud status

* more status renaming

* fix tests that weren't updated after rename

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2026-02-12 15:47:00 -08:00
Parth Sareen
77ba9404ac cmd/tui: improve model picker UX (#14209) 2026-02-11 14:36:54 -08:00
Patrick Devine
0aaf6119ec feature: add ctrl-g to allow users to use an editor to edit their prompt (#14197) 2026-02-11 13:04:41 -08:00
Parth Sareen
f08427c138 cmd: TUI UX improvements (#14198) 2026-02-11 10:18:41 -08:00
Jeffrey Morgan
9ec733e527 cmd: make 'ollama login' and 'ollama logout' aliases for 'ollama signin' and 'ollama signout' respectively (#14144) 2026-02-09 19:12:42 -08:00
Patrick Devine
235ba3df5c cmd: ollama menu and launch improvements (#14038) 2026-02-09 11:30:16 -08:00
Parth Sareen
5f53fe7884 cmd: ollama launch improvements (#14099) 2026-02-05 15:08:17 -08:00
Bruce MacDonald
c323161f24 cmd: helpful error message for remote models (#14057)
When trying to use cloud model with OLLAMA_HOST="ollama.com" while not signed in a helpful error message is displayed when the user is not signed in telling them they must sign in to use cloud models. This should be the same experience for models which specify a remote instance.
2026-02-04 14:55:11 -08:00
Bruce MacDonald
a6355329bf cmd: open browser on ollama signin when available (#14055)
When a browser is available open it to the connect URL automatically when running the `ollama signin` command. Browser is not opened in any other unauthorized scenario.
2026-02-03 16:42:09 -08:00
Seokrin Taron Sung
f92e362b2e cmd: capitalize Ollama in serve command help text (#13965) 2026-01-29 09:47:53 -08:00
Parth Sareen
aae6ecbaff cmd: rename ollama config to ollama launch (#13871) 2026-01-23 18:40:40 -08:00
Jeffrey Morgan
66831dcf70 x/imagegen: fix image editing support (#13866)
- Fix panic in ollama show for image gen models (safe type assertion)
- Add vision capability for Flux2KleinPipeline models at create time
- Flatten transparent PNG images onto white background for better results
2026-01-23 15:37:17 -08:00
Parth Sareen
199c41e16e cmd: ollama config command to help configure integrations to use Ollama (#13712) 2026-01-22 20:17:11 -08:00
Jeffrey Morgan
68e00c7c36 fix: prevent image generation models from loading during deletion (#13781)
Move the unload check (empty prompt + KeepAlive=0) before the image
generation model dispatch in GenerateHandler. This prevents models like
flux from being loaded into memory just to be immediately unloaded when
running `ollama rm`.

Also fix a bug in DeleteHandler where `args[0]` was used instead of
`arg` in the delete loop, causing only the first model to be unloaded
when deleting multiple models.
2026-01-19 12:48:34 -08:00
Jeffrey Morgan
634c416645 Add experimental image generation fields to /api/generate (#13753)
Request fields (experimental):
- width: image width (max 4096)
- height: image height (max 4096)
- steps: denoising steps
- seed: random seed

Response fields (experimental):
- images: base64-encoded generated images
- completed: current step progress
- total: total steps

Other changes:
- Fix lifecycle bug where image models wouldn't unload (refCount issue)
- Fix "headers already written" error on Ctrl+C during streaming
- Add gin middleware for OpenAI /v1/images/generations compatibility
- Update CLI to use /api/generate with progress bar
- Add preload support in interactive mode
2026-01-17 18:27:41 -08:00
Patrick Devine
a077d996e3 Fix create and show commands for experimental models (#13741)
* x: make `ollama create --experimental` import from safetensors

This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.

* gofumpt the linter

* gofumpt the linter again

* validate the model name
2026-01-16 14:31:55 -08:00
Parth Sareen
d06acbcb19 x/cmd: enable web search and web fetch with flag (#13690) 2026-01-12 13:59:40 -08:00
Jeffrey Morgan
9667c2282f x/imagegen: add naive TeaCache and FP8 quantization support (#13683)
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%

FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8

Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
2026-01-12 13:45:22 -08:00
Jeffrey Morgan
2584940016 Add z-image image generation prototype (#13659) 2026-01-09 21:09:46 -08:00
Parth Sareen
1ef4241727 x: request access for all commands, add welcome message (#13662) 2026-01-09 18:20:39 -08:00
Parth Sareen
12e2b3514a x: agent loop ux improvements (#13635) 2026-01-07 01:27:15 -08:00
Parth Sareen
76912c062a x: add experimental agent loop (#13628) 2026-01-05 23:38:40 -08:00
Jeffrey Morgan
8852220f59 add REQUIRES command to Modelfile (#13361) 2025-12-18 13:21:29 -08:00
Patrick Devine
d3e0a0dee4 model: ministral w/ llama4 scaling (#13292)
This change:

* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2025-12-01 23:20:14 -08:00
nicole pardal
1ca608bcd1 embeddings: added embedding command for cl (#12795)
Co-authored-by: A-Akhil <akhilrahul70@gmail.com>

This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.

Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows

Outputs embeddings as JSON arrays (one per line)
2025-11-05 11:58:03 -08:00
Patrick Devine
f89fc1cadd bugfix: show connection string for interactive cli usage (#12930) 2025-11-05 11:55:04 -08:00
Michael Yang
ed78e127d0 fix(cmd): unload model before removal (#12832)
this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted
2025-10-30 10:41:49 -07:00
Patrick Devine
b04e46da3e bugfix: restore the current runOptions if loading fails in the CLI (#12402)
There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
  1. it will not restore the current model settings if the current model is a thinking model; and
  2. it will crash is the current model is a non-thinking model

This bug fix saves the current runOptions and then restores them if the model load
doesn't happen. It also fixes the crash happening for non-thinking models.
2025-09-25 18:30:45 -07:00
Patrick Devine
5a56ff3cf0 cli: add device signin flow when doing ollama push (#12405) 2025-09-25 15:04:43 -07:00
Patrick Devine
64883e3c4c auth: fix problems with the ollama keypairs (#12373)
* auth: fix problems with the ollama keypairs

This change adds several fixes including:
  - reading in the pubkey files correctly
  - fixing the push unit test to create a keypair file in a temp directory
  - not return 500 errors for normal status error
2025-09-22 23:20:20 -07:00
Patrick Devine
8b894933a7 engine: add remote proxy (#12307) 2025-09-17 14:40:53 -07:00