Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast1h 20mMay 22, 2026

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “Reiner Pope – Chip design from the bottom up” inside PodZeus.

AI-Generated Summary

In this deep dive into AI chip design, Reiner Pope, CEO of Maddox, dismantles the inner workings of AI accelerators from the ground up, starting with the most fundamental building blocks: logic gates and wires. He demonstrates how a 4-bit multiply-accumulate operation—central to neural network computations—is implemented using AND gates and full adders, revealing that the real cost isn't in the computation itself, but in the data movement required to feed it. This insight leads to the revolutionary concept of systolic arrays, which bake matrix multiplication directly into hardware, drastically reducing communication overhead by storing weights locally and streaming inputs in slowly. Pope explains how this shift—from individual CUDA cores to massive, fixed-function systolic arrays—has driven the efficiency gains in modern AI chips like NVIDIA’s Tensor Cores and Google’s TPUs. He contrasts GPU and TPU architectures, showing how GPUs’ tiled, fine-grained design limits scalability, while TPUs’ coarser, unified structure enables larger, more efficient arrays. The discussion then turns to clock cycles, pipeline registers, and the trade-offs between speed, area, and throughput, culminating in a powerful insight: the most efficient chips aren’t just faster—they’re designed to minimize the cost of moving data, not just computing it.

Key Takeaways
1

The fundamental primitive in AI chips is multiply-accumulate, not simple multiplication, because it matches the structure of matrix multiplication in neural networks.

2

Data movement costs—especially from register files to logic units—can consume 7-8x more chip area than the actual computation, making it the primary bottleneck.

3

Systolic arrays solve the data movement problem by storing weights locally and streaming inputs slowly, reducing communication bandwidth by a factor of the matrix size.

4

The clock cycle is determined by the slowest path in the chip’s logic; adding pipeline registers can increase speed but at the cost of area and throughput.

5

FPGAs are 10x more expensive than ASICs not because of complexity, but because they use lookup tables (LUTs) that require 32 gates to implement a single AND gate.

…and 3 more takeaways available in PodZeus

Chapters
0:00
10 min

The Building Blocks of AI Chips: Logic Gates and Wires

The main function that AI chips want to compute is multiplication of matrices, and really inside that is the fundamental primitive is multiply accumulative just like of pairs of numbers.

Highlight
10:00
10 min

Why Multiply-Accumulate? The Math Behind the Choice

As you are summing up this number, you are summing up a whole bunch of numbers. And so you've got a lot of rounding errors accumulating. Whereas in this case, there's only one multiplication in that chain. And so there's not a lot of rounding errors accumulating in the multiplication.

Highlight
20:00
10 min

The Hidden Cost: Data Movement in Register Files

Almost all of the cost, like 7 8ths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself.

Highlight
30:00
15 min

The Systolic Array Revolution: Baking Computation into Hardware

We want to have quadratically more compute. We do, we have. We've got sort of X times Y as much compute as we had before. But we're going to want to somehow aim for having only X times as much communication.

Highlight
45:00
15 min

GPU vs. TPU: Architectural Trade-Offs

Pope contrasts GPU and TPU architectures, showing that GPUs use many small, tiled units (SMs) with their own register files and schedulers, limiting systolic array size. TPUs use fewer, larger units with a central vector unit, enabling bigger, more efficient arrays and better amortization of fixed costs.

High-Impact Quotes
We've talked publicly about something which we call a splittable systolic array, which is sort of in some sense you can think of as like big systolic arrays that can be small systolic arrays as well.
Reiner Pope80:15
Viral: 90.0
Almost all of the cost, like 7 8ths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself.
Reiner Pope25:21
Viral: 88.0
The main function that AI chips want to compute is multiplication of matrices, and really inside that is the fundamental primitive is multiply accumulative just like of pairs of numbers.
Reiner Pope0:56
Viral: 85.0
Speakers

Host

Dwarkesh Patel

Guest

Reiner Pope
Topics Discussed
ai chip design95%systolic array92%multiply accumulate90%data movement cost88%gpu vs tpu architecture85%fpga vs asic83%clock cycle optimization80%deterministic latency78%
People & Brands

reiner pope

person

12xPositive

maddox

organization

8xPositive

tpu

other

7xPositive

fpga

other

6xNeutral

nvidia

organization

6xNeutral

tensor cores

other

5xPositive

tsmc

organization

3xNeutral

jane street

organization

2xNeutral

cursor

organization

2xPositive

crusoe

organization

2xPositive

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “Reiner Pope – Chip design from the bottom up” inside PodZeus.

Start discovering podcast insights today

Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.

No credit card required • 7-day trial • Cancel anytime