mirror of https://github.com/ollama/ollama.git synced 2026-04-28 03:39:48 +02:00

Files

Patrick Devine 7bbcd2e6be server: add v2 manifest path

This change adds a new manifest-v2/ path for new models created with the
create/pull/copy commands. Under manifest-v2, manifests are now just blobs which are
content addressable similar to tensors/config files. The named tags instead
will symlink/hard link/contain a copy depending on what the file system supports.

Downgrades to older versions of ollama are still possible, but any create/pull/copy
done with the newer version will potentially have its blobs pruned by the older
version.

manifest-v2 also changes the default registry name to `ollama.com` instead of
`registry.ollama.ai`.

2026-04-21 12:05:54 -07:00

cache

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

cmd/engine

MLX: harden for init failures (#14777 )

2026-03-10 22:52:23 -07:00

docs

safetensors quantization for mlx (#14184 )

2026-02-10 11:29:17 -08:00

manifest

server: add v2 manifest path

2026-04-21 12:05:54 -07:00

mlx

mlx: Improve M5 performance with NAX (#15345 )

2026-04-07 08:12:24 -07:00

models

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

mlx: perf improvements (#14768 )

2026-03-12 12:01:28 -07:00

safetensors

create: Clean up experimental paths, fix create from existing safetensor model (#14679 )

2026-04-07 08:12:57 -07:00

tokenizer

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

transfer

pull/push: refine safetensors (#14946 )

2026-04-08 14:15:39 -07:00

vae

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

.gitignore

Add experimental MLX backend and engine with imagegen support (#13648 )

2026-01-08 16:18:59 -08:00

cli.go

x/imagegen: fix image editing support (#13866 )

2026-01-23 15:37:17 -08:00

image_processor.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

image.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

imagegen.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

memory_test.go

x/imagegen: replace memory estimation with actual weight size (#13848 )

2026-01-22 18:32:41 -08:00

memory.go

chore: move x/mlxrunner into x/imagegen (#14100 )

2026-02-05 18:25:56 -08:00

README.md

x/imagegen: add naive TeaCache and FP8 quantization support (#13683 )

2026-01-12 13:45:22 -08:00

runner.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

server.go

mlx: fix imagegen lookup (#15588 )

2026-04-16 10:39:00 -07:00

types.go

chore: move x/mlxrunner into x/imagegen (#14100 )

2026-02-05 18:25:56 -08:00

README.md

Image Generation in Ollama (Experimental)

Generate images from text prompts using local AI models.

Quick Start

# Run with a prompt
ollama run z-image "a sunset over mountains"
Generating: step 30/30
Image saved to: /tmp/ollama-image-1704067200.png

On macOS, the generated image will automatically open in Preview.

Supported Models

Model	VRAM Required	Notes
z-image	~12GB	Based on Flux architecture

CLI Usage

# Generate an image
ollama run z-image "a cat playing piano"

# Check if model is running
ollama ps

# Stop the model
ollama stop z-image

API

OpenAI-Compatible Endpoint

POST /v1/images/generations

Request:

{
  "model": "z-image",
  "prompt": "a sunset over mountains",
  "size": "1024x1024",
  "response_format": "b64_json"
}

Response:

{
  "created": 1704067200,
  "data": [
    {
      "b64_json": "iVBORw0KGgo..."
    }
  ]
}

Example: cURL

curl http://localhost:11434/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "z-image",
    "prompt": "a white cat",
    "size": "1024x1024"
  }'

Example: Save to File

curl -s http://localhost:11434/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "z-image",
    "prompt": "a white cat",
    "size": "1024x1024"
  }' | jq -r '.data[0].b64_json' | base64 -d > image.png

Streaming Progress

Enable streaming to receive progress updates via SSE:

curl http://localhost:11434/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model": "z-image", "prompt": "a sunset", "stream": true}'

Events:

event: progress
data: {"step": 1, "total": 30}

event: progress
data: {"step": 2, "total": 30}
...

event: done
data: {"created": 1704067200, "data": [{"b64_json": "..."}]}

Parameters

Parameter	Type	Default	Description
model	string	required	Model name
prompt	string	required	Text description of image
size	string	"1024x1024"	Image dimensions (WxH)
n	int	1	Number of images (currently only 1 supported)
response_format	string	"b64_json"	"b64_json" or "url"
stream	bool	false	Enable progress streaming

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
CUDA: tested on CUDA 12 Blackwell, more testing coming soon
Sufficient VRAM (see model table above)
Ollama built with MLX support

Limitations

macOS only (uses MLX backend)
Single image per request
Fixed step count (30 steps)
Modelfiles not yet supported (use ollama create from model directory)

Tensor Model Storage Format

Tensor models store each tensor as a separate blob with metadata in the manifest. This enables faster downloads (parallel fetching) and deduplication (shared tensors are stored once).

Manifest Structure

The manifest follows the standard ollama format with tensor-specific layer metadata:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": { "digest": "sha256:...", "size": 1234 },
  "layers": [
    {
      "mediaType": "application/vnd.ollama.image.tensor",
      "digest": "sha256:25b36eed...",
      "size": 49807448,
      "name": "text_encoder/model.layers.0.mlp.down_proj.weight",
      "dtype": "BF16",
      "shape": [2560, 9728]
    },
    {
      "mediaType": "application/vnd.ollama.image.json",
      "digest": "sha256:abc123...",
      "size": 512,
      "name": "text_encoder/config.json"
    }
  ]
}

Each tensor layer includes:

name: Path-style tensor name (e.g., text_encoder/model.layers.0.mlp.down_proj.weight)
dtype: Data type (BF16, F32, etc.)
shape: Tensor dimensions

Config layers use the same path-style naming (e.g., tokenizer/tokenizer.json).

Blob Format

Each tensor blob is a minimal safetensors file:

[8 bytes: header size (uint64 LE)]
[~80 bytes: JSON header, padded to 8-byte alignment]
[N bytes: raw tensor data]

Header contains a single tensor named "data":

{"data":{"dtype":"BF16","shape":[2560,9728],"data_offsets":[0,49807360]}}

Why Include the Header?

The ~88 byte safetensors header enables MLX's native mlx_load_safetensors function, which:

Uses mmap - Maps file directly into memory, no copies
Zero-copy to GPU - MLX reads directly from mapped pages
No custom code - Standard MLX API, battle-tested

Without the header, we'd need custom C++ code to create MLX arrays from raw mmap'd data. MLX's public API doesn't expose this - it always copies when creating arrays from external pointers.

The overhead is negligible: 88 bytes per tensor = ~100KB total for a 13GB model (0.0007%).

Why Per-Tensor Blobs?

Deduplication: Blobs are content-addressed by SHA256. If two models share identical tensors (same weights, dtype, shape), they share the same blob file.

Example: Model A and Model B both use the same text encoder. The text encoder's 400 tensors are stored once, referenced by both manifests.

~/.ollama/models/
  blobs/
    sha256-25b36eed...  <- shared by both models
    sha256-abc123...
  manifests/
    library/model-a/latest  <- references sha256-25b36eed
    library/model-b/latest  <- references sha256-25b36eed

Import Flow

cd ./weights/Z-Image-Turbo
ollama create z-image

1. Scan component directories (text_encoder/, transformer/, vae/)
2. For each .safetensors file:
   - Extract individual tensors
   - Wrap each in minimal safetensors format (88B header + data)
   - Write to blob store (SHA256 content-addressed)
   - Add layer entry to manifest with path-style name
3. Copy config files (*.json) as config layers
4. Write manifest

FP8 Quantization

Z-Image supports FP8 quantization to reduce memory usage by ~50% while maintaining image quality.

Usage

cd ./weights/Z-Image-Turbo
ollama create z-image-fp8 --quantize fp8

This quantizes weights during import. The resulting model will be ~15GB instead of ~31GB.