MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.
Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
* DRY out the runner lifecycle code
Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place. This also unifies GPU discovery types with the newer ml.DeviceInfo
* win: make incremental builds better
Place build artifacts in discrete directories so incremental builds don't have to start fresh
* Adjust sort order to consider iGPUs
* handle cpu inference oom scenarios
* review comments
* test: harden scheduler tests
This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.
* test: tune tests for partial loads
Give stress tests more time when the model is split between CPU/GPU
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.
Closes: #12578