Mac M1 vs M2 vs M3 vs M4 for Running LLMs - Real Tests - ML Journey
detailed benchmarks and info wrt apple silicon cpus with llama.
Public notes from activescott tagged with #gpu
detailed benchmarks and info wrt apple silicon cpus with llama.
LlamaBarn is a macOS menu bar app for running local LLMs.
While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.
Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)
ROCm:
CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 3258.67 ± 29.23 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 103.31 ± 0.03 |
build: 31ac5834 (3818)
Vulkan:
GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | pp512 | 1077.49 ± 2.00 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | tg128 | 71.83 ± 0.06 |
build: 31ac583
lots of gpu hosting options.
great pu benchmarking suite and list of benchmarks on lots of gpus. predates RTX 50 series and not updated in 2yrs. contains apple silicon too.
“CPUs are becoming the bottleneck in terms of growing out this AI and agentic workflow,” Dion Harris, Nvidia’s head of AI infrastructure, told CNBC this week, calling it an “exciting opportunity.”
The chip giant announced its first data center CPU, Grace, in 2021, and the next generation, Vera, is now in production. The CPUs are typically deployed alongside Nvidia’s famous Hopper, Blackwell or Rubin GPUs in full rack-scale systems.
Exploding demand for GPUs has turned Nvidia into a household name and the most valuable publicly traded company in the world, with a $4.4 trillion market cap. Its broader chip strategy took a major turn in February, when Nvidia struck a multiyear deal with Meta that included the first large-scale deployment of Grace CPUs on their own, with plans to deploy Vera in 2027.
Thousands of standalone Nvidia CPUs are also helping power supercomputers at the Texas Advanced Computing Center and Los Alamos National Lab, Nvidia told CNBC.
Bank of America predicts the CPU market could more than double, from $27 billion in 2025 to $60 billion by 2030. In the latest quarter alone, Nvidia generated data center revenue of over $62 billion, up 75% from a year earlier.
Harris said Nvidia took a fundamentally different approach in design that makes its CPUs “best suited” for data processing and agentic AI workflows, compared to the more general-purpose CPUs made by industry leaders Intel and AMD.
A big difference is in the number of cores in each CPU.
AMD’s EPYC line and Intel’s Xeon high-performance server CPUs typically have 128 cores, compared to 72 cores in Nvidia’s Grace CPU.
“If you’re a hyperscaler, you want to maximize the number of cores per CPU, and that essentially drives down the cost, the dollars per core. So that’s one business model,” Harris explained.
Instead, Nvidia designed its CPU specifically to help its star GPUs run AI workloads.
“Your single-threaded performance becomes much more important than your dollars per core because you’re trying to make sure that that very expensive resource, being the GPU, isn’t sitting there waiting,” Harris said.
Nvidia also bases its CPUs on Arm architecture, more typically used for chips in lower-power devices like smartphones, while Intel and AMD base their CPUs on traditional x86 architecture. Introduced by Intel nearly 50 years ago, x86 is the leading instruction set that has dominated PC and server processor designs since its inception.
AMD’s Norrod said Nvidia has, “Optimized their chips very well, I think, for feeding their GPUs. They’re not well optimized for general-purpose applications.”
FurMark 2 is the successor of the venerable FurMark 1 and is a very intensive GPU stress test on Windows (32-bit and 64-bit) and Linux (32-bit and 64-bit) platforms. It's also a quick OpenGL and Vulkan graphics benchmark with online scores. FurMark 2 has an improved command line support and is built with GeeXLab.
Amazon letting sellers post ads for ~$3K graphics cards, and then the seller ships them fanny packs. Confirmed multiple times and in reviews.
Almost too tempting...
EDIT: 🤣🤣 Look at seller reviews: https://files.catbox.moe/wg1oqi.png
DONT DO IT.
Per a previous post, I actually tried it due to the FBA return policy, cause I was putting it on an empty amz card, and cause I was bored. 🥱😏
you get a Fanny pack I recorded the shit out of opening the package and reported it as fraud. A few people were doing the exact same thing (ordering it it case it was real, and recording opening of the package to CYA) cause we were also bored 😂 out of the norm, you have to take detailed pics and the charge stays on your card until they receive Fanny pack (yes you have to send it back 😂🤣) and ‘inspect it’ I guess due to the massive number of returns and fraud reports+ add to that the normal refund period. Not even a credit to your account would be instant once you return it to a drop off. 😕Clearly a waste of time but will mess you up serious if you use a debit card or if there is any sort of delay that results in them receiving it late, and taking too long to process it.
Amazon will not make you whole on this. Maybe they give you a shitty coupon (some people reported a $10 credit 😒) and it’s not worth your time.
https://imgur.com/a/dHlrKnh
https://imgur.com/a/tYKVyHz
You'd think Amazon would suspend the seller or remove the listing the moment 20+ people all reported it and returned it for the same reason instead of removing the reviews and letting people get scammed (and then dragging their feet getting the defrauded people their money back)
I just received my AORUS RTX 5090 from Amazon, sold and shipped directly by Amazon as brand new. When I opened the box, it was clearly an open-box item and contained only a PCB with no GPU chip or VRAM installed. How does Amazon ship something like this as new?
I saw that the other day. Was tempted, but then when I checked the store all the reviewers were 1 stars saying that they got a fanny pack instead of what they ordered. Amazon, being Amazon had removed or crossed the review out because it was fulfilled by them not the store, basically allowing them to defraud people
One day the mighty data centre could be toppled into obsolescence by the humble smartphone, said Perplexity CEO Aravind Srinivas on a recent podcast.
Apple's AI system, Apple Intelligence, already runs some features on specialised chips inside the firm's latest range of products.
Microsoft's Copilot+ laptops also include on-device AI processing.
a few years ago I heard about a tiny data centre, the size of a washing machine, that was being operated in Devon, UK. In addition to its computing power, the heat it was releasing was warming a public swimming pool.
He thinks every public building should instead house a small data centre, working in a large network with each other where required, and providing heating as a by-product.
The market watcher estimates memory prices surged 40%-50% in the final quarter of 2025, and expects similar gains in the first quarter of 2026, followed by a roughly 20% rise in the second quarter.
NVTOP stands for Neat Videocard TOP, a (h)top like task monitor for GPUs and accelerators. It can handle multiple GPUs and print information about them in a htop-familiar way.
Currently supported vendors are AMD (Linux amdgpu driver), Apple (limited M1 & M2 support), Huawei (Ascend), Intel (Linux i915/Xe drivers), NVIDIA (Linux proprietary divers), Qualcomm Adreno (Linux MSM driver), Broadcom VideoCore (Linux v3d driver).
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 20+ clouds, or on-prem).
This repo has an assembler and disassembler for many Nvidia GPU's. It also brings along some interesting docs on the low level aspects and many identifiers for the GPUs.