Kaio

Rust-native GPU kernels. Compiled to PTX at build time.

The Triton equivalent for the Rust ecosystem. No CUDA toolkit required.

Pure Rust

Write kernels with #[gpu_kernel]

No CUDA C++. No FFI.

Build-Time PTX

Codegen happens at compile time

Type-safe signatures catch dtype errors early

92.5% cuBLAS

Tensor-core matmul perf on large matrices

Production-grade

Four-Layer Architecture

Each layer is independently usable. Drop down a layer when you need more control.

Layer 4

High-Level Ops

FlashAttention, fused attention, matmul variants — pre-built and benchmarked.

Layer 3

Proc Macro

The #[gpu_kernel] attribute. Parses Rust → emits PTX at compile time.

Layer 2

Runtime

Driver integration via cudarc. Launch kernels, manage memory, sync streams.

Layer 1

PTX Codegen

Zero external deps. Pure Rust → PTX assembly. The foundation everything else builds on.

Write Kernels in Rust

Vector Add — End to End

use kaio::prelude::*;

#[gpu_kernel]
fn vector_add(a: &[f32], b: &[f32], out: &mut [f32]) {
    let i = thread_idx_x() + block_idx_x() * block_dim_x();
    if i < out.len() {
        out[i] = a[i] + b[i];
    }
}

fn main() -> Result<(), KaioError> {
    let ctx = Context::new(0)?;
    let a = ctx.upload(&vec![1.0f32; 1_000_000])?;
    let b = ctx.upload(&vec![2.0f32; 1_000_000])?;
    let mut out = ctx.zeros::<f32>(1_000_000)?;

    vector_add::launch(&ctx, (1024, 1, 1), (256, 1, 1), &a, &b, &mut out)?;

    let result = out.download()?;
    assert_eq!(result[0], 3.0);
    Ok(())
}

PTX is Generated at Compile Time

// build.rs picks up #[gpu_kernel] attributes
// and emits PTX assembly into the binary.
//
// No CUDA toolkit. No nvcc. No external compilers.
// Just `cargo build`.

Built-In Operations

Production-tuned kernels ship with the framework.

FlashAttention

Memory-efficient attention. Tiled, block-sparse, matches reference implementation.

Fused Attention

QKV projection + attention + output projection in a single kernel launch.

Tensor-Core Matmul

SM 7.0+ tensor cores. 92.5% of cuBLAS on large matrices, no external lib.

Element-Wise Ops

Activation functions, normalization, fused chains — all auto-generated.

Reductions

Sum, max, mean, softmax. Warp-shuffle optimized.

Custom Kernels

Drop the macro on any function. The framework handles the rest.

Why Kaio?

No CUDA Toolkit Required

PTX is generated by Kaio itself. End users only need an NVIDIA driver — not the full CUDA toolchain.

  • Distribute Rust binaries with embedded GPU code
  • No nvcc, no separate build steps
  • Windows and Linux native — no WSL2 hack

Type Safety All The Way Down

Kernel signatures are checked at compile time. Dtype mismatches don't make it to runtime.

  • 94.7% test coverage
  • Host tests run without GPU hardware
  • Catches the bugs CUDA C++ ships to prod

Requirements & Scope

Setting expectations honestly — pre-1.0 with active development.

What You Need

  • Rust 1.94+
  • NVIDIA GPU with SM 7.0+ (Volta and newer)
  • NVIDIA driver (no CUDA toolkit needed)
  • Windows or Linux

Current Limitations

  • · NVIDIA only — no AMD/Intel GPU support
  • · Inference-only — no backward pass or autograd
  • · Single-GPU — multi-GPU on the roadmap
  • · Pre-1.0 — API may shift before stabilization

Open Source. Built in the Open.

Star it, fork it, file issues. Have a use case we should know about? Tell us — Kaio is being shaped by what people actually want to build.