A Tour of NVIDIA's GPU Programming Stack — From PTX to CuTe DSL

Looking at NVIDIA’s GPU programming ecosystem from the outside in 2026, it feels increasingly chaotic: PTX, CUDA C++, CUTLASS, CuTe, Triton, CuTe DSL, CuTile — names keep showing up in papers, blogs, and PR descriptions, and new ones are still being added. But place them on a continuous ladder from hardware to humans and the whole landscape clicks into focus — the lower you go, the closer to the machine, the higher the achievable performance, the more painful to write; the higher you go, the more it feels like plain Python, the faster the iteration, but the less control you have over the hardware.

This article uses that ladder as the spine to make the relationships clear: which tool lives at which layer, how they actually relate, how real production systems (vLLM and SGLang) mix them together, and — when you actually need to write a piece of GPU code — how to choose between them.

The Abstraction Ladder — five layers from assembly to Python DSL

Let’s first sort the pile of names by abstraction level from bottom to top. The diagram below is the cheat sheet for everything that follows — each layer’s role, representative tools, and place in the ecosystem; remember this skeleton first.

NVIDIA GPU Programming Abstraction LadderLower = closer to hardware · Higher = closer to humansEasy · Fast iterationPeak performance · Steep curveTriton · CuTile · CuTe DSLPython DSL · own compiler · lowers directly to PTX (no CUDA C++ in between)PYTHONCUTLASS · CuTe · ThunderKittensC++ template performance layer · packages layout algebra / GEMM skeletons as reusable partsC++ TEMPLATESCUDA C++ (.cu)The foundation since 2012 · SIMT, explicit grid / block / warp / thread + registers / SMEMC++ LANGUAGEPTXNVIDIA virtual ISA · the unavoidable funnel for all upper layers · a patch tool, not a pathVIRTUAL ASMSASSArchitecture-specific machine code · emitted by driver’s ptxas from PTX · only for profiling / REMACHINE CODE
Five-layer abstraction ladder — the top Python DSL is easy to write but loses fine-grained hardware control; the bottom SASS is near-peak but virtually no one writes it by hand. The three middle layers (C++ templates, CUDA C++, PTX) trace the two-decade evolution and together form the actual terrain.

Walking down the ladder layer by layer:

Layer 5 · Python DSL (Triton / CuTile / CuTe DSL)

The highest abstraction. The shared trait: written in Python, with their own independent compilers that lower the high-level code directly to PTX — no CUDA C++ in between. This is the most important paradigm shift since 2019 — it pulled “writing GPU kernels” out of the C++ world.

  • Triton (OpenAI, 2019) — by far the dominant entry today. The core idea is block/tile-level programming rather than thread-level: you schedule in tile units, and memory coalescing, shared-memory synchronization, warp allocation are all handed off to the compiler.
  • CuTe DSL (NVIDIA, 2025) — a Python frontend over the underlying CuTe layout algebra; the goal is “the developer experience of Python with performance approaching CUTLASS C++.”
  • CuTile / CUDA Tile IR (NVIDIA, 2025) — the official tile programming model introduced in CUDA 13.1, NVIDIA’s “we’ll do Triton ourselves” response.

Layer 4 · C++ template performance layer (CUTLASS / CuTe / ThunderKittens)

The motivation here is: writing main-event operators like matmul / attention in raw CUDA C++ is too painful and too error-prone, so high-performance patterns are packaged into reusable abstractions via C++ templates.

  • CUTLASS (NVIDIA)CUDA Templates for Linear Algebra Subroutines; the name is a nod to the older closed-source library cuBLAS, and you can think of it as “an open-source, customizable BLAS.”
  • CuTe (NVIDIA, from CUTLASS 3.0)CUDA Tensors, the layout algebra inside CUTLASS that describes “how data is laid out + how threads map to data.” It’s the foundation of CUTLASS, not a parallel library.
  • ThunderKittens (Stanford Hazy Research, 2024) — takes the opposite tack, asking how far a small set of “opinionated” abstractions can go. On H100 for GEMM / attention, it matches or beats CUTLASS-based FlashAttention-3 with far less code.

Layer 3 · CUDA C++ (.cu)

The foundation. The only option since 2012, and still the root of the entire ecosystem. The model is SIMT — you organize parallelism explicitly across grid → block (CTA) → warp → thread, and you manage registers, shared memory, tiling, and synchronization yourself. Every upper-layer framework eventually falls back to it; every library is written on top of it.

In everyday usage “CUDA” usually refers to this language, although the word more broadly covers NVIDIA’s entire GPU computing platform (language + compiler + driver + libraries). For the vast majority of application developers, you actually touch it indirectly through precompiled libraries like cuBLAS / cuDNN rather than writing it yourself.

Layer 2 · PTX

Parallel Thread eXecution — NVIDIA’s virtual ISA (intermediate representation). All upper-layer entries (CUDA C++, Triton, CuTe DSL) eventually become PTX, then the ptxas in the driver compiles that down to SASS for the specific architecture.

Essentially no one writes a full kernel from PTX. Its real use is: inside a .cu file, drop in a few instructions via asm volatile to access hardware features the C++ layer doesn’t expose — a specific async copy, a particular cache hint, or a new instruction the compiler hasn’t caught up to yet. PTX is a patch tool, not a development language.

Layer 1 · SASS

Streaming ASSembly — the real machine code for a specific architecture. You don’t write it; you usually only look at it for extreme profiling or reverse engineering.

With these five layers laid out mentally, the ladder is in place. Two more pieces of context make it usable: how compilation funnels these upper-layer entries into PTX, and a parallel set of NVIDIA-shipped closed-source libraries that sit alongside the ladder.

Compilation paths — three frontends, one funnel into PTX

The top three layers (3, 4, 5) offer three different source-language entries, but all paths funnel into PTX — the same virtual ISA — which the driver’s ptxas then turns into SASS:

  • CUDA C++ (.cu) — including everything written with CUTLASS / CuTe templates, going through nvcc, the classical compilation chain.
  • Triton — has its own MLIR-based compiler; lowers Python directly to PTX.
  • CuTe DSL — also MLIR-based; lowers Python-written CuTe layout algebra to PTX.

Neither Triton nor CuTe DSL produces an intermediate .cu file — they are parallel source-language entries to CUDA C++, not built on top of it. This is also why Triton can coexist with cuBLAS / cuDNN and CUTLASS inside PyTorch without conflict: they all end up as the same GPU machine code; only the source differs.

Three Frontends · One ConvergenceDifferent source languages, different compilers, all funnel into PTX → SASSCUDA C++ (.cu)Hand-written __global__ / threadIdxIncludes CUTLASS / CuTe (templates)Triton (Python)Tile-level · auto tiling / coalescePyTorch torch.compile default backendCuTe DSL (Python)Python frontend over CuTe layoutsUsed by FlashAttention 4nvcc (LLVM-based)Triton compiler (MLIR)CuTe DSL compiler (MLIR)PTXVirtual ISA · unavoidable funnel for all pathsptxas (NVIDIA driver)SASS · GPU machine code
Three parallel frontend compilation paths — both Triton and CuTe DSL have their own MLIR-based compilers and lower Python directly to PTX; no .cu file is produced along the way. CUDA C++ is yet another parallel frontend, not their layover. All paths converge at PTX, and the driver compiles down to SASS.

Off the ladder — NVIDIA’s parallel closed-source libraries

The five layers above all live in the world of “source languages” — you write code, then compile it. But NVIDIA also maintains a completely different family of things you use via closed-source, precompiled, single-line API calls. The two most common:

  • cuBLAS (CUDA Basic Linear Algebra Subroutines) — general linear algebra, GEMM / BLAS.
  • cuDNN (CUDA Deep Neural Network library) — deep-learning-specific ops: convolutions, pooling, normalization, attention.

When you use PyTorch, matmul defaults to cuBLAS and convolutions default to cuDNN — these have been the foundation of PyTorch’s performance for years. Distinct from the tools on the ladder: these libraries are black boxes; you can’t see inside or change them, but as long as your parameters are right, you get NVIDIA’s years of tuning for free.

CUTLASS is exactly NVIDIA’s “open-source building-block version” of that closed-source family — the name itself nods to cuBLAS (CUDA Templates for Linear Algebra Subroutines). The functional domain overlaps (both can do GEMM) but the usage is entirely different:

Two Camps · Comparisons Only Make Sense WithinCamp A · Black Box · Call an APIClosed source · you don’t see the internals · uncustomizablecuBLASGeneral linear algebraGEMM · BLAScublasSgemm(…)cuDNNDeep learning opsconv · attention · normcudnnConvolution…(…)Parallel · split by domainFill in params · single callPyTorch’s default workhorse backendsCamp B · Building Blocks · Write CodeOpen source · you must understand · customizableCUTLASSC++ template-assembled high-perf GEMM / conv scaffoldCuTe (CUTLASS’s internal foundation)Layout algebra · how data is laid out + how threads mapHierarchical · CuTe is CUTLASS’s underlying layerWrite code to assemble kernels · not “call a function”Functional overlap · but the way you use them is completely different
Left: cuBLAS / cuDNN are closed-source black boxes — fill in params, call a single API; functionally split as “linear algebra vs deep learning,” parallel with each other. Right: CUTLASS / CuTe are open-source building blocks — you write code that assembles a kernel from templates; CuTe is the foundation under CUTLASS. The two camps overlap on “both can do GEMM” functionally, but the usage is entirely different: black-box, convenient, uncustomizable vs building blocks, costly, customizable.

CUTLASS / CuTe earn a layer of their own because they package not “nicer-looking C++ templates” but a body of NVIDIA’s accumulated algorithmic assets — how to avoid bank conflicts, how to use TMA, how to arrange warps, how to do pipelining — all expressed and composable through CuTe’s layout algebra and CUTLASS’s template parameters. That is precisely why the main-event operators (GEMM, attention) cannot bypass this layer when squeezing for peak. A fitting analogy: CuTe / CUTLASS are to CUDA C++ what NumPy is to hand-written C loops — they change how fast you can reach what performance level and how much you need to understand.

Code Comparison — five ways to write the same GEMM

The most direct way to feel this is to write the same task (C=A×BC = A \times B) five ways. The point isn’t to follow each line, but to feel how dramatically the amount of stuff you have to care about differs.

Way 1 · cuBLAS — black-box call, you just say “what to do”

// Create a handle, call one function, done. You don't care how it computes internally.
cublasHandle_t handle;
cublasCreate(&handle);

float alpha = 1.0f, beta = 0.0f;
// C = alpha * A * B + beta * C
cublasSgemm(handle,
            CUBLAS_OP_N, CUBLAS_OP_N,   // A, B not transposed
            M, N, K,                     // matrix dims
            &alpha, dA, M,               // input A
            dB, K,                       // input B
            &beta, dC, M);               // output C

cublasDestroy(handle);

What you care about: matrix dims, transpose flags, which pointer to use. That’s it. How to tile, whether to use Tensor Core, how threads are partitioned — NVIDIA decides for you, and you can’t see or change it. That’s “black-box.” cuDNN looks essentially the same, just with cudnnConvolutionForward(...) etc.

Way 2 · Plain CUDA C++ — you write it yourself, but naively

// Each thread computes one element of C.
__global__ void gemm_naive(float* A, float* B, float* C, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; ++k)        // write the dot-product loop yourself
            sum += A[row * K + k] * B[k * N + col];
        C[row * N + col] = sum;
    }
}
// Launch: gemm_naive<<<grid, block>>>(dA, dB, dC, M, N, K);

What you care about: how threads map to elements, how to write the loop. Note: this runs, but it’s terribly slow — no shared memory, no Tensor Core, no memory-access optimization. To make it fast, you’d manually add tiling, SMEM staging, bank-conflict handling — hundreds to thousands of lines, and you rewrite it for every new architecture. That’s why CUTLASS / CuTe exist.

Way 3 · CuTe — describe “how data is laid out + who computes what” via layout algebra

CuTe’s essence: it gives you a vocabulary for declaratively describing data layout and thread mapping instead of hand-writing a pile of index math. Core concepts: Layout (shape + stride) and Tensor (data + Layout).

using namespace cute;

// Wrap raw memory as a layout-bearing Tensor.
// make_shape(M,K) is the shape; make_stride describes how it's laid out in memory.
Tensor mA = make_tensor(make_gmem_ptr(A), make_shape(M, K), make_stride(_1{}, M));
Tensor mB = make_tensor(make_gmem_ptr(B), make_shape(N, K), make_stride(_1{}, N));
Tensor mC = make_tensor(make_gmem_ptr(C), make_shape(M, N), make_stride(_1{}, M));

// Declare how to tile: each block handles a 128×128×8 slice.
auto block_tile = make_shape(Int<128>{}, Int<128>{}, Int<8>{});

// Use layouts to "slice" the global matrix into the chunk this block owns.
Tensor gA = local_tile(mA, block_tile, ...);
Tensor gB = local_tile(mB, block_tile, ...);
Tensor gC = local_tile(mC, block_tile, ...);

// Declare a "tiled MMA": which Tensor Core instruction, how warps are arranged.
TiledMMA mma = make_tiled_mma(SM80_16x8x8_F32F16F16F32_TN{}, ...);

// Then do multiply-accumulate over the layout-described tiles.
cute::gemm(mma, gA, gB, gC);

See the difference? No hand-written A[row*K+k] indices; instead, make_shape / make_stride / local_tile declare “what the data looks like and how it’s sliced”, and cute::gemm follows that declaration. That’s layout algebra — it manages data orchestration, not “call a stock function for you.”

Way 4 · CUTLASS — assemble a full kernel by filling template parameters

CUTLASS sits on top of CuTe and lets you assemble a production-grade kernel by filling in template parameters, without building from CuTe atoms yourself.

using namespace cutlass::gemm;

// Don't write a loop — "declare" what each dimension of this GEMM should use:
using Gemm = device::GemmUniversal<
    cutlass::half_t, cutlass::layout::RowMajor,    // A: dtype + layout
    cutlass::half_t, cutlass::layout::ColumnMajor, // B
    float,           cutlass::layout::RowMajor,    // C
    float,                                         // accumulate in float
    cutlass::arch::OpClassTensorOp,                // use Tensor Core
    cutlass::arch::Sm90,                           // target arch: Hopper
    Shape<_128,_128,_64>,                          // block tile size
    Shape<_64, _64, _64>                           // warp tile size
    /* plus epilogue, pipeline stages, and a pile of other knobs */ >;

Gemm gemm_op;
gemm_op({M, N, K}, {dA, lda}, {dB, ldb}, {dC, ldc}, {alpha, beta}); // launch

What you care about: dtype, layout, target arch, tile size, warp partitioning, Tensor Core or not — all performance knobs, you set them. CUTLASS compiles those choices into a highly optimized kernel. A minimal Hopper WGMMA + TMA GEMM example is about 100 lines.

Way 5 · Triton — Python tile-level; hide the hardware details

import triton
import triton.language as tl

@triton.jit
def gemm_kernel(A, B, C, M, N, K,
                stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
                BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
    # This block owns a [BLOCK_M, BLOCK_N] slice of C.
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    # Index vectors inside the tile (not threads — a slice of elements per block).
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    # Iterate along K, load a BLOCK_K-wide tile each step, do tile-level matmul.
    for k in range(0, K, BLOCK_K):
        a = tl.load(A + offs_m[:, None] * stride_am + (k + offs_k[None, :]) * stride_ak)
        b = tl.load(B + (k + offs_k[:, None]) * stride_bk + offs_n[None, :] * stride_bn)
        acc += tl.dot(a, b)   # compiler auto-lowers to Tensor Core MMA

    tl.store(C + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn, acc)

The key insight: the whole code has no threadIdx, no __shared__, no synchronization primitives. You think in tiles — load a tile, do a tile-level matmul, accumulate — and the Triton compiler decides automatically: how threads map, how shared memory is staged, which Tensor Core instructions to emit, how memory coalescing is done. This is its biggest paradigm difference from CUDA C++.

Putting all five side by side:

StyleWhat you write at the coreWhat you need to understandOne-liner
cuBLAS / cuDNNFill params, call 1 functionMatrix dimsOrder from the menu, never enter the kitchen
Plain CUDA C++Hand-written thread indices + loopsThread model (and everything else, for speed)Start from flour, but with the dumbest recipe
CuTeLayout algebra to declare data layout + tile + MMAData layout, Tensor Core instructions, warp mappingA kitchen with proper professional tools
CUTLASSFill template params to assemble a kernelSame as above, but with a pre-built scaffoldA high-performance meal kit you just season
TritonPython tile-level, no threadIdx / SMEM syncTile size, KV access patternsLet the compiler lower to thread level for you

The key intuition: with cuBLAS / cuDNN you say “I want a matmul,” and what’s inside is none of your business; with CuTe / CUTLASS / Triton you describe how the kernel should work internally, just using their high-level vocabulary (layout algebra / template params / tile abstractions) instead of writing raw indices.

Real Cases — FlashAttention and vLLM / SGLang

Two real projects to walk the ladder. FlashAttention shows the “vertical” evolution — the same operator descending and partially re-ascending the ladder as hardware generations roll over, each generation perched on “the highest-performance layer at the time.” vLLM and SGLang show the “horizontal” mix — within a single moment and a single system, different operators are placed on different layers, main operators routed to the best library and gaps filled by Triton. Put the two together and you essentially have the actual shape of industrial GPU programming today.

FlashAttention — a living fossil of NVIDIA’s programming evolution

If we have to pick one project to understand how this whole ecosystem evolved, it has to be FlashAttention. Each generation it changes hardware and changes the programming paradigm — it has walked through every major path in the stack. Read it as the “living fossil.”

FlashAttention four gens · same operator, four programming paradigmsEach generation hugs the latest hardware and picks “the best tool to extract that generation’s peak”FlashAttention 12022A100 (Ampere)Hand-written CUDA C++(+ Triton backend)First to propose tiling +online softmax,SRAM-aware schedulingFlashAttention 22023A100 (tuned)CUTLASS 3.x / CuTe(C++ templates rewrite)Rewritten from scratch ·significantly lower overhead~2× faster than FA1FlashAttention 32024H100 (Hopper)CUTLASS(deep Hopper)Uses WGMMA + TMA +setmaxnreg; FP16 reaches~740 TFLOPSFlashAttention 42025-2026H100 + B200CuTe DSL(Python frontend)Written in Python; perfapproaches C++ · ~50%faster than Triton versionCUDA C++ → CUTLASS / CuTe → CUTLASS (Hopper-tuned) → CuTe DSL (Python)
The same operator travels through four programming paradigms — from hand-written CUDA C++ (FA1) to the C++ templates of CUTLASS / CuTe (FA2, FA3) and then to the Python frontend CuTe DSL (FA4). Each generation stands on “the highest-performance layer at the time” — this migration is the best single snapshot of how NVIDIA’s programming ecosystem has evolved.

The FA4 step deserves a callout: writing Python while still hitting C++-level performance — that is exactly the design goal of CuTe DSL, giving “Python-ecosystem main-event operators + bleeding-edge hardware + must extract 95%+” a path that doesn’t require falling back to C++. It and Triton fill different niches inside the Python world: Triton hides the hardware details (layout / TMA / WGMMA), so you don’t need to understand them to write code, with the trade-off that performance caps at 80-95%; CuTe DSL exposes those details, requiring you to understand them, in exchange for 95%+ control. One serves daily custom operators, the other the top 5-15% peak-extraction scenarios. FA4 picked the latter because attention on Hopper / Blackwell has to be pushed to peak — Triton wasn’t enough, and pure C++ templates were too costly.

vLLM and SGLang — the whole ladder, simultaneously, inside one system

vLLM and SGLang’s core strategy is to be a dispatch layer: hand work to a bunch of best-of-breed backends, and only fill in their own kernels in the gaps where no stock library reaches. They’re more like conductors than performers.

vLLM / SGLang’s multi-backend call stackConductor, not performer — dispatch each scenario to the best backend; fill gaps with TritonApp layer · inference enginevLLMPluggable attention backend · auto-select bestSGLangAuto by hardware: H100→FA3 · B200→TRTLLM · Triton fallbackRouter layer · backend of backendsFlashInferUnified API · routes to concrete kernels: FA2/3 · cuDNN · CUTLASS · TRT-LLMMain kernel layer · external high-perf librariesFlashAttentionFA2 / 3 / 4 · attentionCUTLASS / CuTeGEMM / MoECuTe DSLBlackwell GEMMcuBLAS / cuDNNClosed-source standard opsTRT-LLMFP8 · BlackwellSelf-written gap layer · written by the dispatch layerSelf-written Triton kernelsPagedAttention · KV cache fetch · norms / quant fusions · fallback when no backend coversPTX → SASS · GPU
vLLM and SGLang aren’t “rewriters” of any single layer; they’re the smartest integrators on the ladder — top layer dispatches to the best backend, FlashInfer routes once more, the actual heavy operators are outsourced to FlashAttention / CUTLASS / cuDNN / TRT-LLM, and Triton is reserved for “what nothing else covers.”

vLLM’s own README kernel list says it all: optimized attention kernels include FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton; optimized GEMM / MoE kernels use CUTLASS, TRTLLM-GEN, CuTeDSL; plus torch.compile for automatic kernel generation and graph-level transforms. At least six or seven different kernel sources in one sentence.

SGLang’s strategy is essentially the same; its built-in attention backends are even more varied, with MLA (DeepSeek-style attention) alone offering FlashInfer MLA, FlashMLA, Cutlass MLA, TRTLLM MLA each with a different page_size. The auto-select logic is telling too — Hopper defaults to fa3, Blackwell defaults to trtllm_mha, other architectures default to flashinfer with triton as fallback. The fallback chain spells out the policy: specialized libraries first → general libraries next → Triton as the safety net.

What did these two engines actually write themselves? Three categories:

  1. Dispatch / abstraction layer (their real proprietary value) — a pluggable attention/GEMM backend abstraction + the runtime logic that auto-picks the best one. That’s the real engineering moat, not the kernels themselves.
  2. Self-written Triton kernels — concentrated in two places: (a) operators that wrap their unique data structures (PagedAttention and KV cache fetch for paged KV cache), which no external library will write for you; and (b) cross-hardware fallbacks that guarantee it runs anywhere.
  3. Main high-performance operators — almost entirely outsourced to FlashAttention, FlashInfer, CUTLASS, CuTe DSL, TRT-LLM; they never rebuild these themselves.

This is an iron-clad confirmation of the earlier mental model: main-event operators (attention peak, GEMM) → C++ CUTLASS / CuTe and specialized libraries (peak performance always demands C++ libraries); peripheral / adapter / fallback operators → Triton (write fast in Python, good enough, portable). Neither top-tier production system uses a single tool — both mix tools across layers, by operator importance and scenario. That’s what real-world GPU programming looks like: not picking one path, but using the entire ladder simultaneously inside one system.

Decision Map — how to pick a technical path

Compress everything into one decision map you can use. A single question chains the layers — at each level, ask: “Did the layer above really fall short?” Only descend if it did, because every step down compounds development cost and required hardware knowledge.

GPU programming technical-path decision mapDefault upward · driven by measurements · mix per operatorStep 0 · 90% of people stop hereUse stock librariesPyTorch + cuBLAS / cuDNN / FlashAttention · or torch.compile (auto-generates Triton under the hood)Write no kernels · but already leveraging years of NVIDIA optimizationNeed a custom op beyond stock?Step 1 · split by ecosystemPython ecosystem → TritonTile-level · no hardware knowledge required · 80-95% perfDefault first choice · the gap-filler in vLLM/SGLangC++ ecosystem → libraries firstStandard ops → cuBLAS / cuDNN (black-box, 1 line)Neither works → go to Step 2Main operator · perf measured short?Step 2 · push performanceC++ → CUTLASS / CuTeTemplate params + layout algebra · industrial peakFA2 / FA3 / xFormers / TRT-LLM go herePython → CuTe DSLPython syntax · but demands hardware fluencyFA4 goes here · not a Triton replacementEven CUTLASS can’t cover your fusion?Step 3 · ultimate flexibilityWrite raw CUDA C++ kernelFlexible but you optimize it yourself · very rareStep 4 · a patch, not a tierInline a few PTX instructions anywhere on the way down (asm volatile)
Four steps, one main line — default to Step 0 (stock libraries), descend only when forced. Step 1 splits by ecosystem (Python vs C++); Step 2 climbs to C++ extremes (CUTLASS / CuTe) or its Python frontend (CuTe DSL) when performance falls short; Step 3 (raw CUDA C++) is reserved for the rare special fusion; Step 4’s PTX is a patch you embed inside any of the above, never an independent tier.

Three rules to actually use this map:

  1. Default upward, not downward. Always try the most convenient layer first; only descend a notch after measurement shows it isn’t enough. Don’t write CUTLASS just because you heard it’s fast — most of the time Triton, or even a stock library, suffices. Premature optimization wastes lives.
  2. It’s not “pick one path” — it’s “mix inside one system.” This is the biggest lesson from vLLM / SGLang: main operators (attention / GEMM) use specialized libraries and CUTLASS, peripheral and adapter operators use Triton, the fallback is also Triton. Real systems use the entire ladder simultaneously; the goal is to put each operator at its correct layer.
  3. Distinguish “call” vs “write,” “black box” vs “building blocks.” cuBLAS / cuDNN are black boxes (API calls, uncustomizable); CUTLASS / CuTe / Triton are building blocks (write code, customizable). FlashAttention and the like are finished products others built with the blocks and that you in turn use as a black box. Be clear about whether you want “use what exists” or “build my own,” and the tool choice stops being confusing.

Closing — one-line summary

NVIDIA’s GPU programming landscape looks like a pile of names — PTX, CUDA C++, CUTLASS, CuTe, Triton, CuTe DSL, CuTile… But fundamentally it is a continuous ladder from hardware to humans: the bottom (PTX / SASS) is the exit, the middle (CUDA C++ + CUTLASS / CuTe) is the foundation and the extreme-performance tool, the top (Triton / CuTe DSL) is the Python-era fast entry. All paths converge on the same exit (PTX → SASS); the only difference is which layer you enter from.

Use stock libraries if you can; if you must write a kernel, default to Triton in Python, libraries-then-CUTLASS in C++; for extreme performance use CUTLASS / CuTe (or its Python frontend CuTe DSL); PTX is just a patch wherever a patch is needed.

Descend layer by layer, driven by measurement, mixing by operator — once you load this ladder into your mental model, you can read any project’s kernel directory structure and see exactly what they put at each layer and why.

References — official docs · representative projects · community discussion

NVIDIA official documentation

  • CUDA ToolkitCUDA documentation, PTX ISA reference, CUDA C++ Programming Guide
  • CUTLASSNVIDIA/cutlass on GitHub, CUTLASS 3.x docs, CuTe quick-start guide
  • CUDA Tile IR / CuTile — CUDA 13.1 release notes, the CuTile programming model intro

Triton and PyTorch integration

  • Triton projecttriton-lang/triton on GitHub, Triton programming model paper (Tillet et al. 2019)
  • TorchInductortorch.compile docs, design notes for Triton as PyTorch 2.x’s default codegen backend
  • Gluon — the “one level lower” DSL released by the Triton team, exposing tile layout / memory allocation and other low-level details

FlashAttention evolution

  • FlashAttention 1 — Dao et al. (2022), “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
  • FlashAttention 2 — Dao (2023), “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”
  • FlashAttention 3 — Shah et al. (2024), “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision”; uses Hopper WGMMA + TMA, reaches ~740 TFLOPS
  • FlashAttention 4 — the CuTe DSL implementation in Dao-AILab/flash-attention, covering both SM90 and SM100

Representative projects

Industry overviews and blogs

  • 35 Modern GPU Kernel Frameworks — a survey blog that lays out the lineage of GPU kernel frameworks; the evolution arc CUDA C++ → Triton → CuTe → ThunderKittens → CuTile / CuTe DSL is drawn out clearly
  • Tri Dao’s blog and talks — the FlashAttention author’s first-person account of “why I switched tools”
  • GPU MODE community — the Discord community with the densest high-quality Triton / CUTLASS / CuTe discussion