NVIDIA Releases cuda-oxide: A Native Rust-to-PTX Compiler for SIMT GPU Kernels

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI has launched cuda-oxide, an experimental compiler backend that allows developers to write CUDA SIMT kernels in standard Rust code. The project compiles Rust directly to PTX assembly without requiring C++ code or FFI bindings.

Why This Matters

Traditional GPU programming requires C++ or high-level Python abstractions like Triton, which often obscure hardware-level control. cuda-oxide bridges this gap by bringing the CUDA SIMT model natively to Rust, allowing for safe hardware-specific optimizations while maintaining Rust’s memory safety guarantees. This approach achieves significant performance, with GEMM implementations reaching 868 TFLOPS on B200 hardware, representing 58% of cuBLAS speed-of-light performance.

Key Insights

The backend utilizes rustc_public (Stable MIR) to provide a versioned, stable API over compiler internals, preventing build failures on nightly Rust updates.
The middle stages use Pliron, a Rust-native MLIR-like IR framework, enabling the entire compiler to build with cargo without a C++ toolchain or CMake.
Hardware-specific barrier semantics are protected by disabling JumpThreading optimizations and marking synchronization primitives as convergent in LLVM IR.
Native support for modern NVIDIA architectures includes intrinsics for Tensor Memory Accelerator (TMA), Thread Block Clusters, and scoped atomics.
Device code from library dependencies is compiled lazily by reading Stable MIR from .rlib metadata on demand during the kernel compilation phase.

Working Examples

Command to observe the full compilation trace from Rust MIR through Pliron dialects to PTX.

cargo oxide pipeline vecadd

Standard command to build the host binary and PTX file simultaneously and execute the application.

cargo oxide run vecadd

Practical Applications

High-performance linear algebra: Implementing GEMM pipelines that utilize B200 hardware features like WGMMA and tcgen05 for near-native performance. Pitfall: Attempting to use index_2d(stride), which is currently documented as unsound in the 0.x release.
Safe GPU Kernel Authoring: Using the DisjointSlice and ThreadIndex abstractions to create race-free kernels by construction. Pitfall: Manually duplicating bar.sync instructions across branches, which violates SIMT convergence and breaks hardware barrier semantics.

References:

https://www.marktechpost.com/2026/05/09/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx/

On This Page

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

OpenAI Releases MRC Protocol: Scaling AI Supercomputing to 131,000 GPUs

LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI

NVIDIA Releases AITune: Automated Backend Optimization for PyTorch Inference