#self-hosted + #llama.cpp - activescott's Notes on Ramblefeed

Sunday, April 12, 2026

Mac M1 vs M2 vs M3 vs M4 for Running LLMs - Real Tests - ML Journey

mljourney.com/mac-m1-vs-m2-vs-m3-vs-m4-for-running-llms-real-tests/

detailed benchmarks and info wrt apple silicon cpus with llama.

#8:02 PM

benchmarks self-hosted llama.cpp gpu llm

ggml-org/LlamaBarn: A cosy home for your LLMs.

github.com/ggml-org/LlamaBarn

LlamaBarn is a macOS menu bar app for running local LLMs.

#7:32 PM

self-hosted llama.cpp gpu llm

You'd probably have a lot better luck using Vulkan acceleration (not ROCm) of ll... | Hacker News

news.ycombinator.com/item?id=41631680

While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.

Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)

ROCm:

CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 3258.67 ± 29.23 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 103.31 ± 0.03 |

build: 31ac5834 (3818)

Vulkan:

GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | pp512 | 1077.49 ± 2.00 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | tg128 | 71.83 ± 0.06 |

build: 31ac583

#7:30 PM

self-hosted llama.cpp gpu llm

guide : using the new WebUI of llama.cpp · ggml-org/llama.cpp · Discussion #16938

github.com/ggml-org/llama.cpp/discussions/16938

This guide highlights the key features of the new SvelteKit-based WebUI of llama.cpp.

The new WebUI in combination with the advanced backend capabilities of the llama-server delivers the ultimate local AI chat experience. A few characteristics that set this project ahead of the alternatives:
Free, open-source and community-driven
Excellent performance on all hardware
Advanced context and prefix caching
Parallel and remote user support
Extremely lightweight and memory efficient
Vibrant and creative community
100% privacy

#7:08 PM

self-hosted llama.cpp llm