mirror of https://github.com/ollama/ollama.git synced 2026-04-17 15:53:27 +02:00

Files

Daniel Hiltgen 10e51c5177 MLX: add header vendoring and remove go build tag (#14642 )

* prefer rocm v6 on windows

Avoid building with v7 - more changes are needed

* MLX: add header vendoring and remove go build tag

This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first.  This enables building the new MLX
based code by default.  Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions.  Basic Windows
and Linux support are verified.

* ci: harden for flaky choco repo servers

CI sometimes fails due to choco not actually installing cache.  Since it just speeds up the build, we can proceed without.

* review comments

2026-03-09 17:24:45 -07:00

testdata

Add experimental MLX backend and engine with imagegen support (#13648 )

2026-01-08 16:18:59 -08:00

README.md

Add experimental MLX backend and engine with imagegen support (#13648 )

2026-01-08 16:18:59 -08:00

tokenizer_test.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

tokenizer.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

README.md

Tokenizer

Tokenizer for LLM inference supporting BPE, SentencePiece, and WordPiece algorithms. The goal of this package is to see if a pure Go tokenizer can be fast and correct. It primarily supports the imagegen models however it (or parts of it) could be considered to replace Ollama's tokenizer in the model package.

Features

BPE (Byte Pair Encoding) - GPT-2/Llama style with byte-level encoding
SentencePiece - Gemma style with ▁ space handling
WordPiece - BERT style with ## continuation tokens
Parallel encoding - Automatic parallelization for inputs >4KB
HuggingFace compatible - Loads tokenizer.json directly

Usage

import "github.com/ollama/ollama/x/imagegen/tokenizer"

// Load from HuggingFace model directory
tok, err := tokenizer.Load("./weights/Llama-3.2-1B")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
ids := tok.Encode("Hello, world!", false) // false = don't add BOS

// Decode back to text
text := tok.Decode(ids)

// Check special tokens
if tok.IsEOS(ids[len(ids)-1]) {
    // End of sequence
}

Performance

Benchmarks on Apple M3 Max:

Input Size	Encode	Decode	Tokens
1 KB	14.5 MB/s	267 MB/s	231
10 KB	10.9 MB/s	321 MB/s	2,301
100 KB	8.9 MB/s	311 MB/s	23,001
1 MB	9.6 MB/s	321 MB/s	230,001

Comparison with other implementations (10 MB input):

Implementation	Encode Speed	Notes
Engine (this)	~10 MB/s	stdlib RE2, parallel >4KB
tiktoken (Rust)	~17 MB/s	Highly optimized regex
Ollama (Go)	~2-3 MB/s	regexp2 backtracking

Performance Opportunities

Potential optimizations not yet implemented:

Optimization	Expected Gain	Complexity
Aho-Corasick for special tokens	2-3x for many special tokens	Medium
Custom regex engine (like tiktoken)	1.5-2x	High
SIMD byte scanning	1.3-1.5x for pretokenizer	Medium
Assembly BPE merge loop	1.2-1.5x	High
Memoization for repeated substrings	Variable	Low

Current bottleneck is the pretokenizer regex (~60% of encode time). tiktoken achieves ~17 MB/s with a hand-tuned Rust regex engine.

Not Yet Implemented

Feature	Used By	Notes
Unigram tokenizer	T5, ALBERT, mBART	Different algorithm (not BPE)
Unicode normalizers	Some multilingual models	NFD, NFKC, lowercase, etc.
Custom pretokenizers	Model-specific	Beyond standard patterns

Most HuggingFace models use BPE or SentencePiece, which are fully supported. WordPiece (BERT-style) is also supported with standard [UNK] fallback for out-of-vocabulary characters.

Files

File	Description
`tokenizer.go`	Main implementation (~1000 lines)
`tokenizer_test.go`	Tests and benchmarks
`testdata/`	Mini tokenizer for unit tests