NVIDIA Releases cuda-oxide: A Native Rust-to-PTX Compiler for SIMT GPU Kernels
These articles are AI-generated summaries. Please check the original sources for full details.
NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX
NVIDIA AI has launched cuda-oxide, an experimental compiler backend that allows developers to write CUDA SIMT kernels in standard Rust code. The project compiles Rust directly to PTX assembly without requiring C++ code or FFI bindings.
Why This Matters
Traditional GPU programming requires C++ or high-level Python abstractions like Triton, which often obscure hardware-level control. cuda-oxide bridges this gap by bringing the CUDA SIMT model natively to Rust, allowing for safe hardware-specific optimizations while maintaining Rust’s memory safety guarantees. This approach achieves significant performance, with GEMM implementations reaching 868 TFLOPS on B200 hardware, representing 58% of cuBLAS speed-of-light performance.
Key Insights
- The backend utilizes rustc_public (Stable MIR) to provide a versioned, stable API over compiler internals, preventing build failures on nightly Rust updates.
- The middle stages use Pliron, a Rust-native MLIR-like IR framework, enabling the entire compiler to build with cargo without a C++ toolchain or CMake.
- Hardware-specific barrier semantics are protected by disabling JumpThreading optimizations and marking synchronization primitives as convergent in LLVM IR.
- Native support for modern NVIDIA architectures includes intrinsics for Tensor Memory Accelerator (TMA), Thread Block Clusters, and scoped atomics.
- Device code from library dependencies is compiled lazily by reading Stable MIR from .rlib metadata on demand during the kernel compilation phase.
Working Examples
Command to observe the full compilation trace from Rust MIR through Pliron dialects to PTX.
cargo oxide pipeline vecadd
Standard command to build the host binary and PTX file simultaneously and execute the application.
cargo oxide run vecadd
Practical Applications
- High-performance linear algebra: Implementing GEMM pipelines that utilize B200 hardware features like WGMMA and tcgen05 for near-native performance. Pitfall: Attempting to use index_2d(stride), which is currently documented as unsound in the 0.x release.
- Safe GPU Kernel Authoring: Using the DisjointSlice and ThreadIndex abstractions to create race-free kernels by construction. Pitfall: Manually duplicating bar.sync instructions across branches, which violates SIMT convergence and breaks hardware barrier semantics.
References:
Continue reading
Next article
Scaling Remote Infrastructure: Beyond GUI Limitations
Related Content
OpenAI Releases MRC Protocol: Scaling AI Supercomputing to 131,000 GPUs
OpenAI's new MRC protocol enables 131,000 GPU clusters with 33% fewer optics and microsecond failure recovery for frontier AI model training.
LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI
LightSeek Foundation's TokenSpeed is an open-source LLM inference engine that outperforms TensorRT-LLM by 11% in throughput on NVIDIA B200 GPUs for agentic coding workloads.
Sakana AI and NVIDIA Introduce TwELL: 20.5% Faster LLM Inference via Unstructured Sparsity
Sakana AI and NVIDIA introduced TwELL and custom CUDA kernels, achieving 20.5% inference and 21.9% training speedups in LLMs by exploiting activation sparsity.