BarraCUDA 0.5

@shivam2931120

BarraCUDA 0.5

The first tagged release. The headline is that you can write a Triton kernel, matmul and all, and run it on a CPU with
no GPU. The --cpu backend lowers BIR straight to x86-64 with the SIMT model collapsed into a thread loop, and the
rank-2 tile path materialises and unrolls so tl.dot plus a K-loop sweeps an arbitrary contraction.

New in this cycle

CPU backend (--cpu). CUDA and Triton kernels compile to a host object and run natively. Headline demo:
examples/cpu_launch_matmul.c.
RISC-V backend (--rv64). Same idea, RV64IMFD objects that run under qemu.
Cross-backend differential testing (tests/diff/). Same BIR through two backends, diff the output buffers, CPU
is the oracle. Every case runs --inject so a green result actually means something.
Triton scalar math intrinsics. exp, log, sin, cos, tan, tanh, sqrt, rsqrt, abs, floor,
ceil, maximum, minimum, fdiv. Thanks to @shivam2931120 for the PR, radians-to-turns convention done right.
Triton constexpr ABI compaction. tl.constexpr params with defaults fold to literals and drop out of the
runtime signature.
CUDA fixes. --cpu / --rv64 on their own now run sema and don't trip the parse-dump fallback. Typedef-struct
kernels compile through --cpu, and --parse no longer segfaults on synthetic anon names.
--version flag.

Targets

AMD GPU. --amdgpu-bin for ELF code objects (.hsaco), --amdgpu for assembly. CDNA 2 (gfx90a, MI250), CDNA
3 (gfx942, MI300X), RDNA 2 (gfx1030), RDNA 3 (gfx1100), RDNA 4 (gfx1200).
NVIDIA PTX. --nvidia-ptx, defaults to sm_89. JIT-loaded via the CUDA Driver API.
Tenstorrent Metalium C++. --tensix, SFPU compute.
Tenstorrent baby cores. --rv-elf, native RV32IM ELF via the TDF layer. Integer kernels for now; soft-float
runtime exists but not yet wired in.
x86-64 host object. --cpu, links and runs on Linux.
RV64IMFD. --rv64, runs under qemu-riscv64.
Apple Metal MSL. --metal, stub backend, hardware validation pending.
Intel SPIR-V for Xe. --intel-spirv, stub backend.

Frontends
CUDA C. The same .cu files you'd feed nvcc. Real preprocessor (#include, function-like macros,
#ifdef/#if/#elif).
HIP. --hip or .hip files, CUDA's AMD sibling with the same kernel-language shape.
Triton. @triton.jit Python source through a from-scratch lex/parse/sema/lower. Rank-1 and rank-2 tiles,
tl.dot, K-loop tiling, the math intrinsics above.

Mainframe curios
ABEND dumps (src/runtime/bc_abend.*). GPU faults become IBM-style G0Cx completion codes, correlated against
tracked allocations with a dispatch snapshot. Fires automatically on the HSA path.
SNAP (--snap). Per-kernel parameter dump on entry. AMD only for now.
SYSPRINT. Class-tagged structured kernel output, pattern-routed sinks on the host. Demo:
examples/sysprint_kernel.cu + examples/launch_sysprint.c.
TDF (Tile DataFlow). The layer above BIR for dataflow targets: regions, channels, NoC arcs, L1 placement,
multi-core fission. Dump with --tdf.

Runtime + tooling
HSA runtime launcher (src/runtime/bc_runtime.h). Loads libhsa-runtime64.so at runtime via dlopen, so the
launcher itself has no compile-time ROCm dependency.
Bilingual errors (--lang <file>). Te reo Māori translation included; the format is data, so any language with
a translation file works.
Optimisation passes. mem2reg, constant folding, dead code elimination. Each one is skippable (--no-mem2reg /
--no-cfold / --no-dce) for bisection.
Differential testing harness (tests/diff/), as above.

Validated on real silicon
AMD MI300X (CDNA 3, GFX942). 8/8 test kernels passing. Moa Monte Carlo neutron transport produces correct
physics (k_eff = 0.995 vs reference 1.000).
AMD RDNA3 (GFX1100). Full test suite via the tinygrad mockgpu emulator in CI.
NVIDIA RTX 4060 Ti. Moa transport benchmark produces correct results with a 3.8x speedup over single-thread CPU.
No NVCC anywhere in the pipeline.
Tenstorrent Blackhole. Compiles to valid Metalium C++.

This is not the full changelog as this is the first "release". See CHANGELOG.txt for the full prose.

Thanks to the people who've contributed, especially @nataliakokoromyti and @shivam2931120 and other people who've sent in tips and tricks and raised issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

BarraCUDA 0.5

New in this cycle

Targets

Frontends

Mainframe curios

Runtime + tooling

Validated on real silicon

Contributors

Uh oh!

Releases: Zaneham/BarraCUDA

BarraCUDA 0.5

BarraCUDA 0.5

New in this cycle

Targets

Frontends

Mainframe curios

Runtime + tooling

Validated on real silicon

Contributors

Uh oh!