Mac M1 vs M2 vs M3 vs M4 for Running LLMs - Real Tests - ML Journey
detailed benchmarks and info wrt apple silicon cpus with llama.
Public notes from activescott tagged with both #gpu and #llama.cpp
detailed benchmarks and info wrt apple silicon cpus with llama.
LlamaBarn is a macOS menu bar app for running local LLMs.
While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.
Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)
ROCm:
CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 3258.67 ± 29.23 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 103.31 ± 0.03 |
build: 31ac5834 (3818)
Vulkan:
GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | pp512 | 1077.49 ± 2.00 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | tg128 | 71.83 ± 0.06 |
build: 31ac583