Join us

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels

NVIDIA Hopper packs serious architectural tricks. At the core: **Tensor Memory Accelerator (TMA)**, **tensor cores**, and **swizzling**β€”the trio behind async, cache-friendly matmul kernels that flirt with peak throughput. But folks aren't stopping at cuBLAS. They're stacking new tactics: **warp-group MMA**, SMEM pipelining, **Hilbert curve scheduling**, and **cluster-wide data reuse**. All in plain CUDA C++ with a dusting of inline PTX. No magic libraries, just smart scheduling and brutal efficiency.


Let's keep in touch!

Stay updated with my latest posts and news. I share insights, updates, and exclusive content.

Unsubscribe anytime. By subscribing, you share your email with @faun and accept our Terms & Privacy.

Give a Pawfive to this post!


Only registered users can post comments. Please, login or signup.

Start writing about what excites you in tech β€” connect with developers, grow your voice, and get rewarded.

Join other developers and claim your FAUN.dev() account now!

Avatar

The FAUN

@faun
A worldwide community of developers and DevOps enthusiasts!
Developer Influence
3k

Influence

302k

Total Hits

3712

Posts