Kaio
GPU kernel authoring in Rust.
Write kernels in Rust, compile to PTX at build time, and run with the NVIDIA driver. No CUDA toolkit required.
Headline numbers
TC matmul async, 4096³
fp16→f32, RTX 4090, worst-of-10
INT8 W8A8 matmul, 4096³ median
1.65× cuBLAS sgemm worst
Fused INT4 QKV speedup
vs three separate kernels, decode shapes
RTX 4090 (sm_89), Windows 11, Rust 1.94.1, NVIDIA driver + CUDA 12.8 runtime · 5 warmups + 20 timed iterations × 10 runs · methodology
No CUDA toolkit
Just the NVIDIA driver. Build in CI on a stock GitHub runner.
Windows + Linux
No WSL2. No Linux-only runtime. cargo build works everywhere.
Pure Rust
Kernels in #[gpu_kernel] Rust. No FFI. No Python. No nvcc.
Workspace
Seven crates, each with one job. Drop down a layer when you need more control.
Umbrella crate. Re-exports the prelude and ties the workspace together.
The #[gpu_kernel] proc macro. Parses Rust → emits PTX at compile time.
PTX IR + instruction emitters. Zero external dependencies — the foundation everything else builds on.
CUDA driver wrapper via cudarc. Launch kernels, manage memory, sync streams.
Pre-built kernels: matmul (scalar / tensor-core / INT8 / INT4), attention, fused QKV projection, norms, activations.
candle bridge — 8 forward ops, 2 backward ops, event-based stream sync. Drop Kaio into your existing candle pipeline.
PyO3 bindings — proof-of-concept demo. Request specific bindings via the issue template; the surface grows as demand does.
Write kernels in Rust
The fused SiLU-gate from every LLaMA / Mistral / Qwen feedforward block. Seven lines.
use kaio::prelude::*;
#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
let xi = x[idx];
let sig = 1.0f32 / (1.0f32 + exp(-xi));
out[idx] = xi * sig * gate[idx];
}
}
The proc macro handles PTX generation, register allocation, shared-memory declarations, and launch configuration. Every emitted kernel is validated against ptxas --verify.
Measured performance
Measured against cuBLAS sgemm at 4096³ on RTX 4090. Worst-of-10 framing — the slowest of ten runs.
Tensor-Core Matmul
matmul_tc / matmul_tc_async, fp16 inputs → f32 accumulation
| Variant | Worst | Median |
|---|---|---|
| Sync | 107.0% | 103.2% |
| Async | 115.1% | 111.5% |
vs cuBLAS sgemm (f32×f32). Different precision profiles — both are matmul, both ship to inference paths today.
Quantized Matmul
matmul_int8 (W8A8) and matmul_int4 (W4A16, GPTQ-style)
| Kernel | TOPS | vs sgemm |
|---|---|---|
| INT8 W8A8 | 84–93 | 165% |
| INT4 W4A16 | 52–58 | 126% |
8× memory-bandwidth advantage from low-precision weights drives the ratio. INT8 should ideally compare against cublasGemmEx CUDA_R_8I.
Small shapes (256³–1024³) lag cuBLAS — the 64×64 block tile can't fill 128 SMs at small sizes. Documented honestly in the perf doc.
Built-in operations
Production-tuned kernels ship with kaio-ops.
Tensor-Core Matmul
matmul_tc / matmul_tc_async — fp16→f32 via mma.sync. SM 8.0+. Exceeds cuBLAS at 4096³ in every run.
INT8 Matmul (W8A8)
matmul_int8 — symmetric i8×i8→f32. SM 8.0+, K%32==0. 93 TOPS median at 4096³.
INT4 Matmul (W4A16)
matmul_int4 — GPTQ-style packed s4×f16→f32. group_size=128, K%128==0.
FlashAttention
attention_flash — O(d_k) memory, d_k ≤ 256, no seq_k cap. Causal variant 1.5–1.9× faster at long seq.
Tensor-Core Attention
attention_tc — f16 Q/K/V → f32, seq_k ≤ 384. Separate kernel from FlashAttention, optimized for short sequences.
Fused QKV Projection
qkv_project_int8 / qkv_project_int4 — tri-output fusion, 2.5–4× decode speedup vs three separate kernels.
candle Bridge
kaio-candle — 8 forward ops, 2 backward, event-based stream sync. Drop Kaio kernels into existing candle pipelines.
Norm + Activation
RMSNorm, LayerNorm, softmax, fused SiLU-gate, GELU (exact + fast). Built with shared memory and warp-shuffle reductions.
Custom Kernels
Drop #[gpu_kernel] on any function in the Rust subset DSL. Block reductions, warp shuffles, shared memory, math builtins all available.
Where Kaio fits
Kaio isn't a replacement for candle or burn. It's the layer you reach for when they don't have the op you need.
| Kaio | cudarc | candle / burn | Triton | Raw CUDA | |
|---|---|---|---|---|---|
| Write kernels in Rust | ✓ | load PTX | — | Python | — |
| No CUDA toolkit needed | ✓ | ✓ | varies | — | — |
| Windows native | ✓ | ✓ | partial | no | ✓ |
| Compile-time PTX codegen | ✓ | — | — | runtime JIT | ✓ |
| Type-safe kernel signatures | ✓ | — | — | — | — |
| ML framework integration | candle | standalone | native | PyTorch | manual |
Triton's lack of Windows support and runtime-JIT model are the two friction points Kaio was built to remove for Rust developers.
Design constraints
No CUDA Toolkit Required
PTX is generated by Kaio itself. End users only need an NVIDIA driver — not the full CUDA toolchain.
- ✓ Distribute Rust binaries with embedded GPU code
- ✓ No nvcc, no separate build steps
- ✓ Windows and Linux native — no WSL2 hack
- ✓ NVIDIA driver 525+ is all you ship with
Type Safety All The Way Down
Kernel signatures are checked at compile time. Dtype mismatches don't make it to runtime.
- ✓ 88.39% line coverage workspace-wide
- ✓ Host tests run without GPU hardware
- ✓ Every kernel validated by
ptxas --verify - ✓ Catches bugs CUDA C++ ships to prod
Requirements & scope
Setting expectations honestly — pre-1.0 with active development.
What You Need
- ✓ Rust 1.94+ (pinned via
rust-toolchain.toml) - ✓ NVIDIA GPU SM 7.0+ for general kernels
- ✓ SM 8.0+ for tensor-core ops (TC matmul, INT8/INT4)
- ✓ NVIDIA driver 525+ (CUDA 12.0+ compatible)
- ✓ Windows 10/11 or Linux (Ubuntu 22.04+)
Current Limitations
- · NVIDIA only — no AMD/Intel/Apple Silicon
- · Mostly inference —
matmul_tchas a backward, attention/quantized don't yet - · Single-device — multi-GPU on the roadmap
- · DSL is a Rust subset — no closures, traits, generics, or string ops; can't call external Rust functions
- · Pre-1.0 — API will move before stabilization
Run it locally
Clone the repository and run the showcase on NVIDIA hardware.
git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcase
You'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, int8_matmul, int4_matmul, and quantized_attention compile, launch, verify against a CPU reference, and report median latency. The full transformer-primitive arc plus the W8A8 / W4A16 / fused-QKV headline ops.
Requires an NVIDIA GPU (SM 7.0+) with driver 525 or newer. No CUDA toolkit install needed.
cargo xtask bench
Tensor-core matmul vs cuBLAS sgemm across five sizes
cargo xtask all
Showcase + benchmarks back to back
cargo xtask --help
Full tooling surface
Or read the code
Each showcase is a standalone Cargo project — copy the directory out as a reference for your own kernel.
Activations & Norms
fused_silu_gate— LLaMA-family feedforward activationgelu_comparison— exact vs fast GELU, with the bandwidth-bound lessonrms_norm— single-block RMSNorm via shared memorylayer_norm— single-block LayerNormsoftmax— single-block softmax
Quantization
int8_dequant— INT8 dequantization primitiveint8_matmul— W8A8 matrix multiply, full pathint4_matmul— W4A16 GPTQ-style quantized matmulquantized_attention— full quantized attention path
Roadmap
INT8/INT4 matmul, fused QKV projection, candle bridge.
PyO3 preview crate kaio-py shipped with a binding-request workflow. Full Python surface deferred — request the ops you need and they'll land as demand surfaces.
bf16 TC matmul, ldmatrix.sync.aligned, multi-block reductions, LDG.128 vectorized loads to close the small-shape gap.
Latest from the dev log
Release notes, deep-dives, and what we're working on next.
Open work, maintained publicly.
Kaio is open source infrastructure. Review the code, run the benchmarks, or open an issue with a concrete use case.