Kaio

GPU kernel authoring in Rust.

Write kernels in Rust, compile to PTX at build time, and run with the NVIDIA driver. No CUDA toolkit required.

Headline numbers

115% cuBLAS

TC matmul async, 4096³

fp16→f32, RTX 4090, worst-of-10

93 TOPS

INT8 W8A8 matmul, 4096³ median

1.65× cuBLAS sgemm worst

2.5–4×

Fused INT4 QKV speedup

vs three separate kernels, decode shapes

RTX 4090 (sm_89), Windows 11, Rust 1.94.1, NVIDIA driver + CUDA 12.8 runtime · 5 warmups + 20 timed iterations × 10 runs · methodology

No CUDA toolkit

Just the NVIDIA driver. Build in CI on a stock GitHub runner.

Windows + Linux

No WSL2. No Linux-only runtime. cargo build works everywhere.

Pure Rust

Kernels in #[gpu_kernel] Rust. No FFI. No Python. No nvcc.

Workspace

Seven crates, each with one job. Drop down a layer when you need more control.

kaio

Umbrella crate. Re-exports the prelude and ties the workspace together.

kaio-macros

The #[gpu_kernel] proc macro. Parses Rust → emits PTX at compile time.

kaio-core

PTX IR + instruction emitters. Zero external dependencies — the foundation everything else builds on.

kaio-runtime

CUDA driver wrapper via cudarc. Launch kernels, manage memory, sync streams.

kaio-ops

Pre-built kernels: matmul (scalar / tensor-core / INT8 / INT4), attention, fused QKV projection, norms, activations.

kaio-candle

candle bridge — 8 forward ops, 2 backward ops, event-based stream sync. Drop Kaio into your existing candle pipeline.

kaio-py Preview

PyO3 bindings — proof-of-concept demo. Request specific bindings via the issue template; the surface grows as demand does.

Write kernels in Rust

The fused SiLU-gate from every LLaMA / Mistral / Qwen feedforward block. Seven lines.

use kaio::prelude::*;

#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        let xi = x[idx];
        let sig = 1.0f32 / (1.0f32 + exp(-xi));
        out[idx] = xi * sig * gate[idx];
    }
}

The proc macro handles PTX generation, register allocation, shared-memory declarations, and launch configuration. Every emitted kernel is validated against ptxas --verify.

Measured performance

Measured against cuBLAS sgemm at 4096³ on RTX 4090. Worst-of-10 framing — the slowest of ten runs.

Tensor-Core Matmul

matmul_tc / matmul_tc_async, fp16 inputs → f32 accumulation

VariantWorstMedian
Sync107.0%103.2%
Async115.1%111.5%

vs cuBLAS sgemm (f32×f32). Different precision profiles — both are matmul, both ship to inference paths today.

Quantized Matmul

matmul_int8 (W8A8) and matmul_int4 (W4A16, GPTQ-style)

KernelTOPSvs sgemm
INT8 W8A884–93165%
INT4 W4A1652–58126%

8× memory-bandwidth advantage from low-precision weights drives the ratio. INT8 should ideally compare against cublasGemmEx CUDA_R_8I.

Full performance breakdown →

Small shapes (256³–1024³) lag cuBLAS — the 64×64 block tile can't fill 128 SMs at small sizes. Documented honestly in the perf doc.

Built-in operations

Production-tuned kernels ship with kaio-ops.

Tensor-Core Matmul

matmul_tc / matmul_tc_async — fp16→f32 via mma.sync. SM 8.0+. Exceeds cuBLAS at 4096³ in every run.

INT8 Matmul (W8A8)

matmul_int8 — symmetric i8×i8→f32. SM 8.0+, K%32==0. 93 TOPS median at 4096³.

INT4 Matmul (W4A16)

matmul_int4 — GPTQ-style packed s4×f16→f32. group_size=128, K%128==0.

FlashAttention

attention_flash — O(d_k) memory, d_k ≤ 256, no seq_k cap. Causal variant 1.5–1.9× faster at long seq.

Tensor-Core Attention

attention_tc — f16 Q/K/V → f32, seq_k ≤ 384. Separate kernel from FlashAttention, optimized for short sequences.

Fused QKV Projection

qkv_project_int8 / qkv_project_int4 — tri-output fusion, 2.5–4× decode speedup vs three separate kernels.

candle Bridge

kaio-candle — 8 forward ops, 2 backward, event-based stream sync. Drop Kaio kernels into existing candle pipelines.

Norm + Activation

RMSNorm, LayerNorm, softmax, fused SiLU-gate, GELU (exact + fast). Built with shared memory and warp-shuffle reductions.

Custom Kernels

Drop #[gpu_kernel] on any function in the Rust subset DSL. Block reductions, warp shuffles, shared memory, math builtins all available.

Where Kaio fits

Kaio isn't a replacement for candle or burn. It's the layer you reach for when they don't have the op you need.

Kaio cudarc candle / burn Triton Raw CUDA
Write kernels in Rust load PTX Python
No CUDA toolkit needed varies
Windows native partial no
Compile-time PTX codegen runtime JIT
Type-safe kernel signatures
ML framework integration candle standalone native PyTorch manual

Triton's lack of Windows support and runtime-JIT model are the two friction points Kaio was built to remove for Rust developers.

Design constraints

No CUDA Toolkit Required

PTX is generated by Kaio itself. End users only need an NVIDIA driver — not the full CUDA toolchain.

  • Distribute Rust binaries with embedded GPU code
  • No nvcc, no separate build steps
  • Windows and Linux native — no WSL2 hack
  • NVIDIA driver 525+ is all you ship with

Type Safety All The Way Down

Kernel signatures are checked at compile time. Dtype mismatches don't make it to runtime.

  • 88.39% line coverage workspace-wide
  • Host tests run without GPU hardware
  • Every kernel validated by ptxas --verify
  • Catches bugs CUDA C++ ships to prod

Requirements & scope

Setting expectations honestly — pre-1.0 with active development.

What You Need

  • Rust 1.94+ (pinned via rust-toolchain.toml)
  • NVIDIA GPU SM 7.0+ for general kernels
  • SM 8.0+ for tensor-core ops (TC matmul, INT8/INT4)
  • NVIDIA driver 525+ (CUDA 12.0+ compatible)
  • Windows 10/11 or Linux (Ubuntu 22.04+)

Current Limitations

  • · NVIDIA only — no AMD/Intel/Apple Silicon
  • · Mostly inference — matmul_tc has a backward, attention/quantized don't yet
  • · Single-device — multi-GPU on the roadmap
  • · DSL is a Rust subset — no closures, traits, generics, or string ops; can't call external Rust functions
  • · Pre-1.0 — API will move before stabilization

Run it locally

Clone the repository and run the showcase on NVIDIA hardware.

git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcase

You'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, int8_matmul, int4_matmul, and quantized_attention compile, launch, verify against a CPU reference, and report median latency. The full transformer-primitive arc plus the W8A8 / W4A16 / fused-QKV headline ops.

Requires an NVIDIA GPU (SM 7.0+) with driver 525 or newer. No CUDA toolkit install needed.

cargo xtask bench

Tensor-core matmul vs cuBLAS sgemm across five sizes

cargo xtask all

Showcase + benchmarks back to back

cargo xtask --help

Full tooling surface

Or read the code

Each showcase is a standalone Cargo project — copy the directory out as a reference for your own kernel.

Activations & Norms

Quantization

Roadmap

Phase 7 ✓ Shipped

INT8/INT4 matmul, fused QKV projection, candle bridge.

Phase 8 Complete · Deferred

PyO3 preview crate kaio-py shipped with a binding-request workflow. Full Python surface deferred — request the ops you need and they'll land as demand surfaces.

Phase 9 In Progress

bf16 TC matmul, ldmatrix.sync.aligned, multi-block reductions, LDG.128 vectorized loads to close the small-shape gap.

Open work, maintained publicly.

Kaio is open source infrastructure. Review the code, run the benchmarks, or open an issue with a concrete use case.