Kaio
Rust-native GPU kernels. Compiled to PTX at build time.
The Triton equivalent for the Rust ecosystem. No CUDA toolkit required.
Write kernels with #[gpu_kernel]
No CUDA C++. No FFI.
Codegen happens at compile time
Type-safe signatures catch dtype errors early
Tensor-core matmul perf on large matrices
Production-grade
Four-Layer Architecture
Each layer is independently usable. Drop down a layer when you need more control.
High-Level Ops
FlashAttention, fused attention, matmul variants — pre-built and benchmarked.
Proc Macro
The #[gpu_kernel] attribute. Parses Rust → emits PTX at compile time.
Runtime
Driver integration via cudarc. Launch kernels, manage memory, sync streams.
PTX Codegen
Zero external deps. Pure Rust → PTX assembly. The foundation everything else builds on.
Write Kernels in Rust
Vector Add — End to End
use kaio::prelude::*;
#[gpu_kernel]
fn vector_add(a: &[f32], b: &[f32], out: &mut [f32]) {
let i = thread_idx_x() + block_idx_x() * block_dim_x();
if i < out.len() {
out[i] = a[i] + b[i];
}
}
fn main() -> Result<(), KaioError> {
let ctx = Context::new(0)?;
let a = ctx.upload(&vec![1.0f32; 1_000_000])?;
let b = ctx.upload(&vec![2.0f32; 1_000_000])?;
let mut out = ctx.zeros::<f32>(1_000_000)?;
vector_add::launch(&ctx, (1024, 1, 1), (256, 1, 1), &a, &b, &mut out)?;
let result = out.download()?;
assert_eq!(result[0], 3.0);
Ok(())
}
PTX is Generated at Compile Time
// build.rs picks up #[gpu_kernel] attributes
// and emits PTX assembly into the binary.
//
// No CUDA toolkit. No nvcc. No external compilers.
// Just `cargo build`.
Built-In Operations
Production-tuned kernels ship with the framework.
FlashAttention
Memory-efficient attention. Tiled, block-sparse, matches reference implementation.
Fused Attention
QKV projection + attention + output projection in a single kernel launch.
Tensor-Core Matmul
SM 7.0+ tensor cores. 92.5% of cuBLAS on large matrices, no external lib.
Element-Wise Ops
Activation functions, normalization, fused chains — all auto-generated.
Reductions
Sum, max, mean, softmax. Warp-shuffle optimized.
Custom Kernels
Drop the macro on any function. The framework handles the rest.
Why Kaio?
No CUDA Toolkit Required
PTX is generated by Kaio itself. End users only need an NVIDIA driver — not the full CUDA toolchain.
- ✓ Distribute Rust binaries with embedded GPU code
- ✓ No nvcc, no separate build steps
- ✓ Windows and Linux native — no WSL2 hack
Type Safety All The Way Down
Kernel signatures are checked at compile time. Dtype mismatches don't make it to runtime.
- ✓ 94.7% test coverage
- ✓ Host tests run without GPU hardware
- ✓ Catches the bugs CUDA C++ ships to prod
Requirements & Scope
Setting expectations honestly — pre-1.0 with active development.
What You Need
- ✓ Rust 1.94+
- ✓ NVIDIA GPU with SM 7.0+ (Volta and newer)
- ✓ NVIDIA driver (no CUDA toolkit needed)
- ✓ Windows or Linux
Current Limitations
- · NVIDIA only — no AMD/Intel GPU support
- · Inference-only — no backward pass or autograd
- · Single-GPU — multi-GPU on the roadmap
- · Pre-1.0 — API may shift before stabilization
Latest from the Dev Log
Release notes, deep-dives, and what we're working on next.
Open Source. Built in the Open.
Star it, fork it, file issues. Have a use case we should know about? Tell us — Kaio is being shaped by what people actually want to build.