Commit Graph

12 Commits

Author SHA1 Message Date
Daniel Hiltgen
356c0b8e34 gemma4: add audio support with USM conformer encoder
Add audio encoding for Gemma 4 using the USM conformer architecture:
- Converter: audio tensor mapping, SSCP/conformer/embedder name replacements,
  softplus repacker for per_dim_scale, F32 enforcement for conv weights
- GGML backend: Conv1DDW and PadExt tensor ops
- Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local
  attention with relative position embeddings + LightConv1d + FFW),
  output projection, audio-to-text embedding projector
- Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go)
- Model wiring: WAV detection, audio token handling, unified PostTokenize

Correctly transcribes "why is the sky blue" from test audio.
2026-04-01 15:24:17 -07:00
Daniel Hiltgen
ea3c6a3cbe gemma4: add Gemma 4 GGML model support
Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense)
for the GGML backend including text, vision, converter, parser, and
renderer.

Text model features:
- Sliding window + full attention with per-layer patterns
- KV sharing across layers with donor map
- Per-layer embeddings (PLE) with learned projections
- MoE routing with RMSNorm + learned scale
- Proportional RoPE with freq_factors for global attention
- Final logit softcapping

Vision model features:
- SigLIP vision encoder with 2D RoPE
- ClippableLinear with input/output clamping via packed v.clamp_data
- Adaptive average pooling with nMerge kernel
- Multi-modal projection with unweighted RMSNorm

Converter:
- Safetensors to GGUF with vision tensor renaming
- Fused MoE gate_up_proj splitting
- Vision patch embedding reshape (HF to Conv2D layout)
- Packed clamp data tensor for ClippableLinear bounds
- Proportional RoPE freq_factors generation

Also includes:
- BackendGet() on ml.Tensor for reading weight tensor data
- Q6_K CUDA get_rows kernel support
- MoE-aware ffn_down quantization layer counting
- Gemma4 parser with tool calling and thinking support
- Gemma4 renderer with structured tool format
- Architecture-based auto-detection of renderer/parser/stop tokens
- Integration test gemma4 model list additions
2026-04-01 15:23:10 -07:00
Jeffrey Morgan
da70c3222e model: support for qwen3.5 architecture (#14378) 2026-02-24 20:08:05 -08:00
Bruce MacDonald
9d902d63ce ggml: ensure tensor size is valid (#14406)
When quantizing tensors during model creation validate that the resulting sizes match what is expected based on the shape.
2026-02-24 21:52:44 -04:00
Jeffrey Morgan
77eb2ca619 model: add qwen3-next architecture (#14051) 2026-02-03 23:27:21 -08:00
Jeffrey Morgan
16750865d1 glm4moelite: quantize more tensors to q8_0 and avoid double BOS token (#13891) 2026-01-24 16:33:54 -08:00
Jeffrey Morgan
01cf7445f3 model: add lfm2 architecture and LFM2.5-1.2B-Thinking support (#13792)
Co-Authored-By: TommyBoiss <165361500+TommyBoiss@users.noreply.github.com>
2026-01-20 12:20:53 -08:00
Michael Yang
718961de68 migrate to golangci-lint v2 (#13109)
* migrate to golangci-lint v2
* copyloopvar
2025-11-18 11:00:26 -08:00
Michael Yang
d0b32def60 skip quantizing per_layer_token_embd (#11207)
this tensor isn't compatible with cuda when quantized to q4_K so skip it
2025-06-26 21:49:35 -07:00
Bruce MacDonald
fbe6ae285a server: improve tensor quantization fallback logic (#10806)
Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
2025-05-22 10:48:08 -07:00
Daniel Hiltgen
9d6df90805 Follow up to #10363 (#10647)
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
2025-05-12 15:23:31 -07:00
Daniel Hiltgen
424810450f Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00