* tokenizer: add SentencePiece-style BPE support
Add WithSentencePieceNormalizer option to BytePairEncoding for models
that use BPE with SentencePiece-style space markers (space to/from
U+2581).
NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions
constructor accepts BPEOption functions. Decoding handles the reverse
mapping of U+2581 back to spaces.
* review comments
Replace hardcoded Encode(prompt, true) with
Encode(prompt, r.Tokenizer.AddBOS()) so the pipeline respects each
model's tokenizer configuration.
Models with add_bos_token=true (gemma3, llama): unchanged, tokenizer
still prepends BOS.
Models with bos_token=null (qwen3, qwen3.5): unchanged, the BOS
guard (vocab.BOS >= 0) already prevented prepending regardless of
the flag.
This aligns the pipeline with the /v1/tokenize endpoint which already
uses Tokenizer.AddBOS().
pullModelManifest unmarshals the registry response into a Go struct
then re-marshals with json.Marshal before writing to disk. When the
registry's JSON formatting or field ordering differs from Go's
output, the local SHA256 won't match the registry's
Ollama-Content-Digest header, causing false "out of date" warnings.
Preserve the raw bytes from the registry response and write them
directly to disk so the local manifest is byte-for-byte identical
to what the registry serves.
* anthropic: fix empty inputs in content blocks
When we switched to `api.ToolCallFunctionArguments`, `omitempty` stopped
doing what we were relying on it for before. This would cause non-tool
content blocks to have an `"input": {}` field, which doesn't match our
old behavior.
* use omitzero instead
The staleness check compared the local manifest digest (SHA256 of the
file on disk) against the registry's Ollama-Content-Digest header.
These never matched because PullModel re-serializes the manifest JSON
before writing, producing different bytes than the registry's original.
The fallback comparison (local modified_at vs upstream push time) was
also broken: the generated TypeScript Time class discards the actual
timestamp value, so Date parsing always produced NaN.
Fix by moving the staleness comparison server-side where we have
reliable access to both the local manifest file mtime and the upstream
push time. The /api/v1/model/upstream endpoint now returns a simple
`stale` boolean instead of raw digests for the frontend to compare.
Also adds User-Agent to the CORS allowed headers for dev mode.
A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well.
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Add periodic snapshots every 8k tokens and near the end of the prompt
so that long prompts can be partially restored and thinking/generation
can be retried without full reprocessing.
Update LRU last used time just on the nodes that actually used
during processing rather than all snapshots along the path. This
allows eviction to remove nodes more accurately so we can avoid
other heuristics to auto-merge nodes.
mlx.Copy shares the backing buffer with its source (via
copy_shared_buffer) rather than allocating independent storage.
When used to snapshot a slice of the KV cache, the snapshot array
holds the entire original cache buffer alive through the shared
data pointer — even after eval detaches the computation graph.
Replace Copy with Contiguous in Snapshot and Split. Contiguous
allocates a compact buffer when the source buffer is significantly
larger than the logical slice (Contiguous::eval checks
buffer_size > nbytes + 16384), which is always the case for KV
cache slices.
Copilot Chat prefers to use `general.basename` in the built-in Ollama
integration, but this name isn't usually shown directly to users (and
there may be many models that share this name). Instead we pass back
`req.Model`, which for this extension is the value that we return from
`/api/tags`
* integration: improve ability to test individual models
Add OLLAMA_TEST_MODEL env var to run integration tests against a
single model.
Enhance vision tests: multi-turn chat with cached image tokens, object
counting, spatial reasoning, detail recognition, scene understanding, OCR, and
multi-image comparison.
Add tool calling stress tests with complex agent-style prompts, large
system messages, and multi-turn tool response handling.
* review comments
Previously, a partial match within a node's edge would truncate the path
to the parent snapshot - effectively making all cache types behave as
recurrent caches. Caches with only transformer layers can rewind to
arbitrary boundary so this restores this capability to improve cache
hits
* mlx: update to HEAD on 3/23
Also fixes a few misc vendoring bugs uncovered with this first update.
This also renames the version files to make them clearer.
* CUDA Fast Gated Delta kernel
* mlx: detect eval errors and panic
On model errors or missing kernels, don't mask the error, bubble it up.
Receiving from a buffered chan error consumes the value, so only the
first caller (WaitUntilRunning, HasExited, or Close) sees the signal.
Subsequent receivers block or take the wrong branch. Replace with a
closed chan struct{} which can be received from any number of times,
and store the error in a separate field.
The stderr reader used bufio.Scanner which has a 64KB max line size.
If the subprocess wrote a line exceeding this limit, the scanner would
stop reading, the OS pipe buffer would fill, and the subprocess would
deadlock.
Replace the scanner with a statusWriter that wraps io.Copy. The writer
forwards all stderr to os.Stderr while capturing the last short line
(≤256 bytes) for error reporting, avoiding both the deadlock and the
need to buffer arbitrarily long lines.
If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp
folder will be created. Upon any inference request, the body will be
logged to a file in this folder, as well as a small shell script to
"replay" the request using cURL.
This is just intended for debugging scenarios, not as something to turn
on normally.
Previous xml repair for glm was a good start, but we need to go further and repair any incorrect open or closing tags
Co-authored-by: Dongluo Chen <dongluo.chen@gmail.com>
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
Slice used cmp.Or to resolve a zero stop value to the dimension size,
intended to support open-ended slices like a[i:]. This made Slice(0, 0)
indistinguishable from Slice(), so any slice with a zero stop would
silently include the entire dimension instead of being empty.
Replace cmp.Or with an explicit End sentinel and resolve negative
indices against the dimension size, matching Python/PyTorch semantics.