mirror of
https://github.com/ollama/ollama.git
synced 2026-04-18 17:54:35 +02:00
This change includes: - changes to the safetensors metadata format - changes to the create command to properly create the blobs with the new format - changes to load the new format - fixes ollama show to properly show each tensor
159 lines
5.0 KiB
Markdown
159 lines
5.0 KiB
Markdown
# Tensor Blob Format
|
|
|
|
Ollama stores model tensors as individual blobs in the safetensors format. Each blob contains a logical tensor (or a combined quantized tensor with its scale/bias components), or a group of logical tensors (e.g. shared experts for a given layer along with the scale/bias components for that tensor).
|
|
|
|
## Safetensors File Format
|
|
|
|
Every blob follows the [safetensors](https://github.com/huggingface/safetensors) layout:
|
|
|
|
```
|
|
[8 bytes: header_size (uint64 LE)] [header_size bytes: JSON header] [tensor data region]
|
|
```
|
|
|
|
The JSON header maps tensor names to their dtype, shape, and byte offsets within the data region. A special `__metadata__` key holds string-to-string metadata.
|
|
|
|
## Unquantized Blobs
|
|
|
|
An unquantized blob stores a single tensor keyed by its name:
|
|
|
|
```json
|
|
{
|
|
"model.layers.0.self_attn.q_proj.weight": {
|
|
"dtype": "BF16",
|
|
"shape": [2560, 2560],
|
|
"data_offsets": [0, 13107200]
|
|
}
|
|
}
|
|
```
|
|
|
|
The tensor key is the full tensor name. Dtype is typically `BF16` or `F32`.
|
|
|
|
## Quantized Blobs (Combined Format)
|
|
|
|
A quantized blob stores the packed weight, scaling factors, and optional zero-point biases in a single file. Tensor keys use the tensor name, with `.scale` and `.bias` suffixes for the auxiliary tensors:
|
|
|
|
```json
|
|
{
|
|
"__metadata__": {
|
|
"quant_type": "int4",
|
|
"group_size": "32"
|
|
},
|
|
"model.layers.0.mlp.up_proj.weight": {
|
|
"dtype": "U32",
|
|
"shape": [2560, 320],
|
|
"data_offsets": [0, 3276800]
|
|
},
|
|
"model.layers.0.mlp.up_proj.weight.scale": {
|
|
"dtype": "BF16",
|
|
"shape": [2560, 80],
|
|
"data_offsets": [3276800, 3686400]
|
|
},
|
|
"model.layers.0.mlp.up_proj.weight.bias": {
|
|
"dtype": "BF16",
|
|
"shape": [2560, 80],
|
|
"data_offsets": [3686400, 4096000]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Metadata Fields
|
|
|
|
| Field | Description |
|
|
|---|---|
|
|
| `quant_type` | Quantization type: `int4`, `int8`, `nvfp4`, or `mxfp8` |
|
|
| `group_size` | Number of elements per quantization group (e.g., `32`, `64`) |
|
|
|
|
### Tensor Keys
|
|
|
|
| Key | Description |
|
|
|---|---|
|
|
| `{name}` | Packed quantized weights (dtype `U32`) |
|
|
| `{name}.scale` | Per-group scaling factors |
|
|
| `{name}.bias` | Per-group zero-point offsets (affine modes only) |
|
|
|
|
## Quantization Types
|
|
|
|
| Type | Bits | Group Size | Mode | Has Bias |
|
|
|---|---|---|---|---|
|
|
| `int4` | 4 | 32 | affine | yes |
|
|
| `int8` | 8 | 64 | affine | yes |
|
|
| `nvfp4` | 4 | 16 | nvfp4 | no |
|
|
| `mxfp8` | 8 | 32 | mxfp8 | no |
|
|
|
|
**Affine modes** (`int4`, `int8`) use `scale + bias` for dequantization. The bias tensor provides the zero-point offset.
|
|
|
|
**Non-affine modes** (`nvfp4`, `mxfp8`) use only `scale` with specialized E4M3 scale formats.
|
|
|
|
### Packed Weight Shape
|
|
|
|
Quantized weights are packed into `uint32` values:
|
|
- **4-bit** (int4, nvfp4): 8 values per uint32, so `packed_cols = original_cols / 8`
|
|
- **8-bit** (int8, mxfp8): 4 values per uint32, so `packed_cols = original_cols / 4`
|
|
|
|
Scale shape: `[rows, original_cols / group_size]`
|
|
|
|
## Manifest References
|
|
|
|
Blobs are referenced from the model manifest as layers:
|
|
|
|
```json
|
|
{
|
|
"mediaType": "application/vnd.ollama.image.tensor",
|
|
"digest": "sha256:abc123...",
|
|
"size": 4096150,
|
|
"name": "model.layers.0.mlp.up_proj.weight"
|
|
}
|
|
```
|
|
|
|
Each tensor (quantized or not) is one layer in the manifest. The layer name matches the tensor key in the blob header.
|
|
|
|
## Packed Blobs (Expert Groups)
|
|
|
|
For MoE (Mixture of Experts) models, expert tensors from the same layer are packed into a single blob to reduce blob count and improve loading efficiency. A packed blob is a standard safetensors file containing multiple tensor entries:
|
|
|
|
```json
|
|
{
|
|
"model.layers.1.mlp.experts.0.down_proj.weight": {
|
|
"dtype": "U32",
|
|
"shape": [2560, 640],
|
|
"data_offsets": [0, 6553600]
|
|
},
|
|
"model.layers.1.mlp.experts.0.down_proj.weight.scale": {
|
|
"dtype": "BF16",
|
|
"shape": [2560, 40],
|
|
"data_offsets": [6553600, 6963200]
|
|
},
|
|
"model.layers.1.mlp.experts.0.gate_proj.weight": {
|
|
"dtype": "U32",
|
|
"shape": [10240, 320],
|
|
"data_offsets": [6963200, 20070400]
|
|
},
|
|
"model.layers.1.mlp.experts.0.gate_proj.weight.scale": { "..." : "..." }
|
|
}
|
|
```
|
|
|
|
### Grouping Rules
|
|
|
|
- `model.layers.{L}.mlp.experts.*` tensors are packed into one blob per layer
|
|
- `model.layers.{L}.mlp.shared_experts.*` tensors are packed into one blob per layer
|
|
- All other tensors remain as individual blobs
|
|
|
|
### Manifest Representation
|
|
|
|
One manifest layer per packed group, using the group prefix as the layer name:
|
|
|
|
```json
|
|
{
|
|
"mediaType": "application/vnd.ollama.image.tensor",
|
|
"digest": "sha256:...",
|
|
"size": 123456789,
|
|
"name": "model.layers.1.mlp.experts"
|
|
}
|
|
```
|
|
|
|
## Loading
|
|
|
|
At load time, `mlx_load_safetensors` opens each blob via mmap for zero-copy access. For combined quantized blobs, the loader extracts `{name}`, `{name}.scale`, and `{name}.bias` tensors and caches them as `name`, `name + "_scale"`, and `name + "_qbias"` respectively, maintaining compatibility with the weight loading interface.
|
|
|
|
For packed blobs, if the manifest layer name (group prefix) is not found as a tensor key, the loader parses the blob header to discover all tensor names and loads each individually.
|