#hardware

Public notes from activescott tagged with #hardware

Friday, May 29, 2026

Wednesday, May 20, 2026

The net result is a chip with a lot of compute and a lot of SRAM that is blisteringly fast to access. To put it in numbers, the WSE-3 (Cerebras’ latest chip) has 44GB of on-chip SRAM at 21 PB/s of bandwidth; an H100 has 80GB of HBM at 3.35 TB/s. In other words, the WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.

The reason to compare the WSE-3 to an H100 is that the H100 is the chip most used for inference — and inference is clearly what Cerebras is most well-suited for. You can use Cerebras chips for training, but the chip-to-chip networking story isn’t very compelling, which is to say that all of that compute and on-chip memory is mostly just sitting around; what is much more interesting is the idea of getting a stream of tokens at dramatically faster speed than you can from a GPU.

Note, however, that the limitation in terms of training also potentially applies in terms of inference: as long as everything fits in on-chip memory Cerebras’ speed is an incredible experience; the moment you need more memory, whether that be for a larger model or, more likely, a larger KV cache, then Cerebras doesn’t make much sense, particularly given the price.

At the same time, I do think there will be a market for Cerebras-style chips: right now the company is highlighting the usefulness of speed for coding — reasoning means a lot of tokens, which means that dramatically scaling up tokens-per-second equals faster thinking — but I think this is a temporary use case, for reasons I’ll explain in a bit. What does matter is how long humans are waiting for an answer, and as products like AI wearables become more of a thing, the speed of interaction, particularly for voice — which will be a function of token generation speed — will have a tangible effect on the user experience.

All of this falls under the banner of “inference”, but I think it will be increasingly clear that there is a difference between providing an answer — what I will call “answer inference” — and doing a task — what I will call “agentic inference.” Cerebras’ target market is “answer inference”; in the long run, I think the architecture for “agentic inference” will look a lot different, not just from Cerebras’ approach, but from the GPU approach as well.

I mentioned above that fast inference for coding is a temporary use case. Specifically, coding with LLMs requires a human in the loop. It’s the human that defines what is to be coded, checks the work, commits the pull request, etc.; it’s not hard to envision a future, however, where all of this is completely handled by machines. This will apply to agentic work broadly: the true power of agents will not be that they do work for humans, but rather that they do work without human involvement at all.

This, by extension, will mean that the likely best approach to solving agentic inference will look a lot different than answer inference. The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory. Agents need context, state, and history. Some of that will live as active KV cache; some will live in host memory or SSDs; much of it will live in databases, logs, embeddings, and object stores. The important point is that agentic inference will be less about GPUs answering a question and more about the memory hierarchy wrapped around a model.

Critically, this articulation of an agentic-specific memory hierarchy implies a necessary trade-off of speed for capacity. Here’s the thing, though: lower speed isn’t nearly as important a consideration if there isn’t a human in the loop. If an agent is waiting around for a job that is being run overnight, the agent doesn’t know or care about the user experience impact; what is most important is being able to accomplish a task, and if entirely new approaches to memory make that possible, then delays are fine.

Meanwhile, if delays are fine, then all of the focus on pure compute power and high-bandwidth memory seems out of place: if latency isn’t the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense. And if the entire system is mostly waiting on memory, then chips don’t need to be as fast as the cutting edge either. This represents a profound shift in future architectures, but it also doesn’t mean that current architectures are going away:

Monday, April 13, 2026

Wednesday, March 18, 2026

Monday, March 16, 2026

“CPUs are becoming the bottleneck in terms of growing out this AI and agentic workflow,” Dion Harris, Nvidia’s head of AI infrastructure, told CNBC this week, calling it an “exciting opportunity.”

The chip giant announced its first data center CPU, Grace, in 2021, and the next generation, Vera, is now in production. The CPUs are typically deployed alongside Nvidia’s famous Hopper, Blackwell or Rubin GPUs in full rack-scale systems.

Exploding demand for GPUs has turned Nvidia into a household name and the most valuable publicly traded company in the world, with a $4.4 trillion market cap. Its broader chip strategy took a major turn in February, when Nvidia struck a multiyear deal with Meta that included the first large-scale deployment of Grace CPUs on their own, with plans to deploy Vera in 2027.

Thousands of standalone Nvidia CPUs are also helping power supercomputers at the Texas Advanced Computing Center and Los Alamos National Lab, Nvidia told CNBC.

Bank of America predicts the CPU market could more than double, from $27 billion in 2025 to $60 billion by 2030. In the latest quarter alone, Nvidia generated data center revenue of over $62 billion, up 75% from a year earlier.

Harris said Nvidia took a fundamentally different approach in design that makes its CPUs “best suited” for data processing and agentic AI workflows, compared to the more general-purpose CPUs made by industry leaders Intel and AMD.

A big difference is in the number of cores in each CPU.

AMD’s EPYC line and Intel’s Xeon high-performance server CPUs typically have 128 cores, compared to 72 cores in Nvidia’s Grace CPU.

“If you’re a hyperscaler, you want to maximize the number of cores per CPU, and that essentially drives down the cost, the dollars per core. So that’s one business model,” Harris explained.

Instead, Nvidia designed its CPU specifically to help its star GPUs run AI workloads.

“Your single-threaded performance becomes much more important than your dollars per core because you’re trying to make sure that that very expensive resource, being the GPU, isn’t sitting there waiting,” Harris said.

Nvidia also bases its CPUs on Arm architecture, more typically used for chips in lower-power devices like smartphones, while Intel and AMD base their CPUs on traditional x86 architecture. Introduced by Intel nearly 50 years ago, x86 is the leading instruction set that has dominated PC and server processor designs since its inception.

AMD’s Norrod said Nvidia has, “Optimized their chips very well, I think, for feeding their GPUs. They’re not well optimized for general-purpose applications.”

Monday, March 9, 2026

Saturday, March 7, 2026

Wednesday, February 18, 2026

Sunday, February 1, 2026

FurMark 2 is the successor of the venerable FurMark 1 and is a very intensive GPU stress test on Windows (32-bit and 64-bit) and Linux (32-bit and 64-bit) platforms. It's also a quick OpenGL and Vulkan graphics benchmark with online scores. FurMark 2 has an improved command line support and is built with GeeXLab.

Saturday, January 10, 2026

Tuesday, December 23, 2025

Saturday, November 15, 2025