KAIO v0.2.0: Writing GPU Kernels in Rust at 92.5% of cuBLAS

KAIO — Writing GPU Kernels in Rust Without CUDA

If you have ever tried writing custom ML inference in Rust (using Candle, Burn, or your own stack), you eventually hit a wall. You want more control, but the only way to get it is not in Rust. You drop into CUDA C++, write FFI bindings, deal with platform issues, pray to the WSL2 gods, and maintain two build systems. Or you don’t write the kernel at all.

Python has an answer for this: triton.jit.

Rust didn’t. We have made progress as a community, but there is a real gap between what Python devs can ship in a weekend and what stays blocked in Rust waiting on the few devs who can write both CUDA and Rust fluently.

So I decided to attempt to fill that gap.

What is KAIO?

Think of it as Triton-lite, but in Rust. You write a Rust function with #[gpu_kernel]. It compiles to PTX at build time. It runs through the NVIDIA driver at runtime.

No CUDA toolkit. No Python. No C++ build step. Works on Windows and Linux.

Here’s a real kernel. Fused SiLU gate (used in LLaMA, Mistral, Qwen):

use kaio::prelude::*;

#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        let xi = x[idx];
        let sig = 1.0f32 / (1.0f32 + exp(-xi));
        out[idx] = xi * sig * gate[idx];
    }
}

That’s it, that’s the entire kernel. The macro handles PTX generation, register allocation, shared memory declarations, and launch configuration. Every kernel is validated with ptxas --verify.

Performance

Let’s talk about what KAIO can do today, not what it can be.

v0.2.0 ships with tensor-core matmul hitting:

92.5% of cuBLAS sgemm at 4096² (RTX 4090)

Important context: KAIO’s tensor-core path uses fp16 inputs with fp32 accumulation via mma.sync, while cuBLAS sgemm is f32→f32. This is not a precision-equivalent comparison. It’s a practical baseline for tracking progress.

There are two paths: a synchronous tile loop and an async cp.async pipeline. The async variant wins at large sizes because it overlaps global loads with math. The progression across Phase 6 on the async path:

Phase	% of cuBLAS sgemm (4096²)
Phase 4	31%
Sprint 6.7	85.1%
Sprint 6.7b	92.5%

Small matrices are slow; the kernel needs enough work to fill the GPU.

When would you use KAIO?

You reach for it when:

Your framework can’t express the kernel you need
You don’t want to drop into CUDA C++
You want to stay entirely in Rust

The Part That Actually Mattered

My first benchmarks hit 31% of cuBLAS. That was fine. Correctness came first, and KAIO was landing at 1e-7 on matmul. Fast-and-wrong is a very bad thing.

I assumed vectorized loads (LDG.128) would be my big win, hoping to push into 60%. I found something more interesting before I got there.

The real bottleneck was shared-memory bank conflicts. Fixing that alone gave +2.4pp on the sync path and +7.4pp on async. That is what pushed async to 92.5%.

What was happening: shared tile rows all started at bank 0. Fragment loads caused 8-way serialization across the warp. Adding 8 bytes of padding to the row stride broke the alignment. Conflicts disappeared. The async pipeline was already hiding global memory latency well; it just couldn’t breathe at fragment-read time. Remove the contention, and performance jumped.

Then came the interesting failure. I tried flipping Tile B to row-major (required for clean LDG.128 cooperative loads). Mid-implementation, I caught a structural problem.

mma.sync fragment-B expects each b32 register to pack two consecutive rows of the same column. In col-major shared, those rows are 2 bytes apart; one load grabs both. In row-major shared, they are row_stride bytes apart and can’t be packed in one load. Instruction count went from 3072 to 3712. The “optimization” was a regression.

I reverted. Shipped without LDG.128. Because the data said so.

Three rules that mattered more than any optimization idea: measure every step, commit intermediate states, and define ship-vs-revert thresholds before you start.

Backstory

I started writing code well over a decade ago, beginning with PHP and moving into Python. Python is amazing. It lets you iterate at the speed of your own thought, as fast as you can hammer keys, and the ecosystem is full and mature.

When I first tried Rust it still felt brand new. A guy in my office told me to check out this new language, said it was type-safe and you don’t have to worry about pointers and footguns as much as C. I started reading the docs and fell in love. Everything clicked. Rust works the way my brain wanted low-level programming to feel out of the box. (C still has a place in my heart, don’t get me wrong.) Fast forward to the AI/ML wave. ML is not novel anymore, it is in every product, including products that don’t need it. I decided I wanted to write ML and AI in Rust. The iteration cycles would be slower, but for production it makes sense: you get the performance and the safety in the same language.

That path is what led me to the wall I opened with. It is also what led to writing KAIO over one of those weekends where you can’t sleep, stare at a problem you wish had a tool for, and somewhere around the 4th cup of coffee and too much nicotine you say fuck it and start writing code.

A Note on AI-Assisted Development

I use AI heavily in my workflow, but not the way a lot of devs do. I treat it as a force multiplier, not a thinking replacement. I read every line of code that ships, even the boilerplate. Architecture and API design run through me, not the model.

AI mirrors what the person using it can do. If there is substance to mirror, it amplifies you. If there isn’t, you get what people rightly call AI slop.

The reason I bring this up is because the three rules I landed on from the bank-conflict detour are what keep this kind of workflow honest:

measure every step
commit intermediate states
define ship vs revert thresholds before you start

Those rules force the data to decide, not the model pretending to be confident.

Limitations

I’m not trying to hide what KAIO can’t do. KAIO is not trying to replace Triton today:

NVIDIA only (SM 7.0+)
Inference only (no autograd)
Rust subset DSL (no traits, generics, closures inside kernels)
Pre-1.0 API
Small matmul shapes are slow

If any of those are dealbreakers today, come back at 1.0.

What’s Next

INT8 quantized matmul is up next: that’s the unlock for running quantized LLMs in pure Rust. After that: bf16/TF32 tensor cores, deeper Candle integration, and PyO3 bindings so Python can call KAIO kernels too.

Examples

I kept asking myself what would make a skeptical dev actually star this repo. The answer was always the same: a folder full of examples I can copy from and get on with my day. So here is my attempt at that.

cargo add kaio
# or clone and run an example:
git clone https://github.com/dmriding/kaio && cd kaio/examples/fused_silu_gate
cargo run --release

Examples you can run immediately:

fused_silu_gate — gated SiLU activation from LLaMA-family models
gelu_comparison — exact vs. fast GELU, with a teaching moment on why both run at the same speed (bandwidth-bound, not compute-bound)
rms_norm — single-block RMSNorm using shared memory and warp reductions

Docs: docs.rs/kaio Repo: github.com/dmriding/kaio

I Want Your Opinion

If you made it this far, either you like my style of writing, you think this has merit, or both. Either way, thank you. This project only becomes useful if people like you use it for real problems.

What would you build first? Where do your frameworks fall short? Is the DSL workable, or too restrictive?

Those answers shape where this goes next. Whether it’s a GitHub issue or community feedback, I welcome any tickets or ideas.

Tags: #rust #gpu #tensor-cores #kaio #cuda #ptx #machine-learning