Commit Graph

4 Commits

Author SHA1 Message Date
Jesse Gross
4f5999fd3f mlxrunner: schedule periodic snapshots during prefill
Add periodic snapshots every 8k tokens and near the end of the prompt
so that long prompts can be partially restored and thinking/generation
can be retried without full reprocessing.
2026-03-26 13:32:11 -07:00
Jesse Gross
ac5f0dbb6a mlxrunner: improve eviction and LRU tracking
Update LRU last used time just on the nodes that actually used
during processing rather than all snapshots along the path. This
allows eviction to remove nodes more accurately so we can avoid
other heuristics to auto-merge nodes.
2026-03-26 13:32:11 -07:00
Jesse Gross
77491439c2 mlxrunner: support partial match on pure transformer caches
Previously, a partial match within a node's edge would truncate the path
to the parent snapshot - effectively making all cache types behave as
recurrent caches. Caches with only transformer layers can rewind to
arbitrary boundary so this restores this capability to improve cache
hits
2026-03-23 17:44:19 -07:00
Jesse Gross
96e36c0d90 mlxrunner: share KV cache across conversations with common prefixes
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
2026-03-18 16:06:33 -07:00