gemma4: enable flash attention (#15378)

Backport GGML kernels so we can enable flash attention for the gemma 4 model on Metal and CUDA.
2026-04-17 15:53:27 +02:00 · 2026-04-07 08:12:36 -07:00
parent 8968740836
commit e823bff873
20 changed files with 559 additions and 36 deletions
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@@ -890,6 +890,7 @@ func (f GGML) FlashAttention() bool {
 	return slices.Contains([]string{
 		"bert",
 		"gemma3",
+		"gemma4",
 		"glm4moelite",
 		"glmocr",
 		"gptoss", "gpt-oss",