ollama/x/imagegen/docs/blob-format.md

# Tensor Blob Format

Ollama stores model tensors as individual blobs in the safetensors format. Each blob contains a logical tensor (or a combined quantized tensor with its scale/bias components), or a group of logical tensors (e.g. shared experts for a given layer along with the scale/bias components for that tensor).

## Safetensors File Format

Every blob follows the [safetensors](https://github.com/huggingface/safetensors) layout:

```
[8 bytes: header_size (uint64 LE)] [header_size bytes: JSON header] [tensor data region]
```

The JSON header maps tensor names to their dtype, shape, and byte offsets within the data region. A special `__metadata__` key holds string-to-string metadata.

## Unquantized Blobs

An unquantized blob stores a single tensor keyed by its name:

```json
{
  "model.layers.0.self_attn.q_proj.weight": {
    "dtype": "BF16",
    "shape": [2560, 2560],
    "data_offsets": [0, 13107200]
  }
}
```

The tensor key is the full tensor name. Dtype is typically `BF16` or `F32`.

## Quantized Blobs (Combined Format)

A quantized blob stores the packed weight, scaling factors, and optional zero-point biases in a single file. Tensor keys use the tensor name, with `.scale` and `.bias` suffixes for the auxiliary tensors:

```json
{
  "__metadata__": {
    "quant_type": "int4",
    "group_size": "32"
  },
  "model.layers.0.mlp.up_proj.weight": {
    "dtype": "U32",
    "shape": [2560, 320],
    "data_offsets": [0, 3276800]
  },
  "model.layers.0.mlp.up_proj.weight.scale": {
    "dtype": "BF16",
    "shape": [2560, 80],
    "data_offsets": [3276800, 3686400]
  },
  "model.layers.0.mlp.up_proj.weight.bias": {
    "dtype": "BF16",
    "shape": [2560, 80],
    "data_offsets": [3686400, 4096000]
  }
}
```

### Metadata Fields

| Field | Description |
|---|---|
| `quant_type` | Quantization type: `int4`, `int8`, `nvfp4`, or `mxfp8` |
| `group_size` | Number of elements per quantization group (e.g., `32`, `64`) |

### Tensor Keys

| Key | Description |
|---|---|
| `{name}` | Packed quantized weights (dtype `U32`) |
| `{name}.scale` | Per-group scaling factors |
| `{name}.bias` | Per-group zero-point offsets (affine modes only) |

## Quantization Types

| Type | Bits | Group Size | Mode | Has Bias |
|---|---|---|---|---|
| `int4` | 4 | 32 | affine | yes |
| `int8` | 8 | 64 | affine | yes |
| `nvfp4` | 4 | 16 | nvfp4 | no |
| `mxfp8` | 8 | 32 | mxfp8 | no |

**Affine modes** (`int4`, `int8`) use `scale + bias` for dequantization. The bias tensor provides the zero-point offset.

**Non-affine modes** (`nvfp4`, `mxfp8`) use only `scale` with specialized E4M3 scale formats.

### Packed Weight Shape

Quantized weights are packed into `uint32` values:
- **4-bit** (int4, nvfp4): 8 values per uint32, so `packed_cols = original_cols / 8`
- **8-bit** (int8, mxfp8): 4 values per uint32, so `packed_cols = original_cols / 4`

Scale shape: `[rows, original_cols / group_size]`

## Manifest References

Blobs are referenced from the model manifest as layers:

```json
{
  "mediaType": "application/vnd.ollama.image.tensor",
  "digest": "sha256:abc123...",
  "size": 4096150,
  "name": "model.layers.0.mlp.up_proj.weight"
}
```

Each tensor (quantized or not) is one layer in the manifest. The layer name matches the tensor key in the blob header.

## Packed Blobs (Expert Groups)

For MoE (Mixture of Experts) models, expert tensors from the same layer are packed into a single blob to reduce blob count and improve loading efficiency. A packed blob is a standard safetensors file containing multiple tensor entries:

```json
{
  "model.layers.1.mlp.experts.0.down_proj.weight": {
    "dtype": "U32",
    "shape": [2560, 640],
    "data_offsets": [0, 6553600]
  },
  "model.layers.1.mlp.experts.0.down_proj.weight.scale": {
    "dtype": "BF16",
    "shape": [2560, 40],
    "data_offsets": [6553600, 6963200]
  },
  "model.layers.1.mlp.experts.0.gate_proj.weight": {
    "dtype": "U32",
    "shape": [10240, 320],
    "data_offsets": [6963200, 20070400]
  },
  "model.layers.1.mlp.experts.0.gate_proj.weight.scale": { "..." : "..." }
}
```

### Grouping Rules

- `model.layers.{L}.mlp.experts.*` tensors are packed into one blob per layer
- `model.layers.{L}.mlp.shared_experts.*` tensors are packed into one blob per layer
- All other tensors remain as individual blobs

### Manifest Representation

One manifest layer per packed group, using the group prefix as the layer name:

```json
{
  "mediaType": "application/vnd.ollama.image.tensor",
  "digest": "sha256:...",
  "size": 123456789,
  "name": "model.layers.1.mlp.experts"
}
```

## Loading

At load time, `mlx_load_safetensors` opens each blob via mmap for zero-copy access. For combined quantized blobs, the loader extracts `{name}`, `{name}.scale`, and `{name}.bias` tensors and caches them as `name`, `name + "_scale"`, and `name + "_qbias"` respectively, maintaining compatibility with the weight loading interface.

For packed blobs, if the manifest layer name (group prefix) is not found as a tensor key, the loader parses the blob header to discover all tensor names and loads each individually.