bench: improve benchmarking tool (#14240)

New features: - Warmup phase to eliminate cold-start outliers - time-to-first-token measured in each epoch - VRAM/memory tracking to identify CPU spillover - Controlled prompt length - Defaults to 6 epochs and 200 tokens max Benchstat fixes: - ns/request instead of ns/op — non-standard unit created a separate group instead of grouping with timing metrics - Token count as the N field — benchstat interprets N as iteration count for statistical weighting, not as a token count
2026-04-18 16:54:13 +02:00 · 2026-03-15 11:47:31 -07:00
parent f8b657c967
commit 79c1e93c00
3 changed files with 1471 additions and 309 deletions
--- a/cmd/bench/README.md
+++ b/cmd/bench/README.md
@@ -1,27 +1,31 @@
 Ollama Benchmark Tool
 ---------------------

-A Go-based command-line tool for benchmarking Ollama models with configurable parameters and multiple output formats.
+A Go-based command-line tool for benchmarking Ollama models with configurable parameters, warmup phases, TTFT tracking, VRAM monitoring, and benchstat/CSV output.

 ## Features

 * Benchmark multiple models in a single run
 * Support for both text and image prompts
 * Configurable generation parameters (temperature, max tokens, seed, etc.)
- * Supports benchstat and CSV output formats
- * Detailed performance metrics (prefill, generate, load, total durations)
+ * Warmup phase before timed epochs to stabilize measurements
+ * Time-to-first-token (TTFT) tracking per epoch
+ * Model metadata display (parameter size, quantization level, family)
+ * VRAM and CPU memory usage tracking via running process info
+ * Controlled prompt token length for reproducible benchmarks
+ * Benchstat and CSV output formats

 ## Building from Source

 ```
-go build -o ollama-bench bench.go
-./ollama-bench -model gpt-oss:20b -epochs 6 -format csv
+go build -o ollama-bench ./cmd/bench
+./ollama-bench -model gemma3 -epochs 6 -format csv
 ```

 Using Go Run (without building)

 ```
-go run bench.go -model gpt-oss:20b -epochs 3
+go run ./cmd/bench -model gemma3 -epochs 3
 ```

 ## Usage
@@ -45,10 +49,16 @@ benchstat -col /name gemma.bench
 ./ollama-bench -model qwen3-vl -image photo.jpg -epochs 6 -max-tokens 100 -p "Describe this image"
 ```

+### Controlled Prompt Length
+
+```
+./ollama-bench -model gemma3 -epochs 6 -prompt-tokens 512
+```
+
 ### Advanced Example

 ```
-./ollama-bench -model llama3 -epochs 10 -temperature 0.7 -max-tokens 500 -seed 42 -format csv -output results.csv
+./ollama-bench -model llama3 -epochs 10 -temperature 0.7 -max-tokens 500 -seed 42 -warmup 2 -format csv -output results.csv
 ```

 ## Command Line Options
@@ -56,41 +66,48 @@ benchstat -col /name gemma.bench
 | Option  	| Description | Default |
 |----------|-------------|---------|
 | -model	| Comma-separated list of models to benchmark	| (required)		|
-| -epochs	| Number of iterations per model		| 1			|
-| -max-tokens	| Maximum tokens for model response		| 0 (unlimited)		|
+| -epochs	| Number of iterations per model		| 6			|
+| -max-tokens	| Maximum tokens for model response		| 200			|
 | -temperature	| Temperature parameter				| 0.0			|
 | -seed		| Random seed					| 0 (random)		|
 | -timeout	| Timeout in seconds				| 300			|
-| -p		| Prompt text					| "Write a long story."	|
+| -p		| Prompt text					| (default story prompt)	|
 | -image	| Image file to include in prompt		| 			|
 | -k		| Keep-alive duration in seconds		| 0			|
 | -format	| Output format (benchstat, csv)		| benchstat		|
 | -output	| Output file for results			| "" (stdout)		|
+| -warmup	| Number of warmup requests before timing	| 1			|
+| -prompt-tokens	| Generate prompt targeting ~N tokens (0 = use -p)	| 0		|
 | -v		| Verbose mode					| false			|
 | -debug	| Show debug information			| false			|

 ## Output Formats

-### Markdown Format
+### Benchstat Format (default)

-The default markdown format is suitable for copying and pasting into a GitHub issue and will look like:
-```
- Model | Step | Count | Duration | nsPerToken | tokensPerSec |
-|-------|------|-------|----------|------------|--------------|
-| gpt-oss:20b | prefill | 124 | 30.006458ms | 241987.56 | 4132.44 |
-| gpt-oss:20b | generate | 200 | 2.646843954s | 13234219.77 | 75.56 |
-| gpt-oss:20b | load | 1 | 121.674208ms | - | - |
-| gpt-oss:20b | total | 1 | 2.861047625s | - | - |
-```
-
-### Benchstat Format
-
-Compatible with Go's benchstat tool for statistical analysis:
+Compatible with Go's benchstat tool for statistical analysis. Uses one value/unit pair per line, standard `ns/op` for timing metrics, and `ns/token` for throughput. Each epoch produces one set of lines -- benchstat aggregates across repeated runs to compute statistics.

 ```
-BenchmarkModel/name=gpt-oss:20b/step=prefill 128 78125.00 ns/token 12800.00 token/sec
-BenchmarkModel/name=gpt-oss:20b/step=generate 512 19531.25 ns/token 51200.00 token/sec
-BenchmarkModel/name=gpt-oss:20b/step=load 1 1500000000 ns/request
+# Model: gemma3 | Params: 4.3B | Quant: Q4_K_M | Family: gemma3 | Size: 4080218931 | VRAM: 4080218931
+BenchmarkModel/name=gemma3/step=prefill 1 78125.00 ns/token 12800.00 token/sec
+BenchmarkModel/name=gemma3/step=generate 1 19531.25 ns/token 51200.00 token/sec
+BenchmarkModel/name=gemma3/step=ttft 1 45123000 ns/op
+BenchmarkModel/name=gemma3/step=load 1 1500000000 ns/op
+BenchmarkModel/name=gemma3/step=total 1 2861047625 ns/op
+```
+
+Use with benchstat:
+```
+./ollama-bench -model gemma3 -epochs 6 > gemma3.bench
+benchstat -col /step gemma3.bench
+```
+
+Compare two runs:
+```
+./ollama-bench -model gemma3 -epochs 6 > before.bench
+# ... make changes ...
+./ollama-bench -model gemma3 -epochs 6 > after.bench
+benchstat before.bench after.bench
 ```

 ### CSV Format
@@ -99,17 +116,28 @@ Machine-readable comma-separated values:

 ```
 NAME,STEP,COUNT,NS_PER_COUNT,TOKEN_PER_SEC
-gpt-oss:20b,prefill,128,78125.00,12800.00
-gpt-oss:20b,generate,512,19531.25,51200.00
-gpt-oss:20b,load,1,1500000000,0
+# Model: gemma3 | Params: 4.3B | Quant: Q4_K_M | Family: gemma3 | Size: 4080218931 | VRAM: 4080218931
+gemma3,prefill,128,78125.00,12800.00
+gemma3,generate,512,19531.25,51200.00
+gemma3,ttft,1,45123000,0
+gemma3,load,1,1500000000,0
+gemma3,total,1,2861047625,0
 ```

 ## Metrics Explained

-The tool reports four types of metrics for each model:
+The tool reports the following metrics for each epoch:

- * prefill: Time spent processing the prompt
- * generate: Time spent generating the response
- * load: Model loading time (one-time cost)
- * total: Total request duration
+ * **prefill**: Time spent processing the prompt (ns/token)
+ * **generate**: Time spent generating the response (ns/token)
+ * **ttft**: Time to first token -- latency from request start to first response content
+ * **load**: Model loading time (one-time cost)
+ * **total**: Total request duration

+Additionally, the model info comment line (displayed once per model before epochs) includes:
+
+ * **Params**: Model parameter count (e.g., 4.3B)
+ * **Quant**: Quantization level (e.g., Q4_K_M)
+ * **Family**: Model family (e.g., gemma3)
+ * **Size**: Total model memory in bytes
+ * **VRAM**: GPU memory used by the loaded model (when Size > VRAM, the difference is CPU spill)