Skip to content

Releases: Zaneham/BarraCUDA

BarraCUDA 0.5

29 May 04:30
cdbbf67

Choose a tag to compare

BarraCUDA 0.5

The first tagged release. The headline is that you can write a Triton kernel, matmul and all, and run it on a CPU with
no GPU. The --cpu backend lowers BIR straight to x86-64 with the SIMT model collapsed into a thread loop, and the
rank-2 tile path materialises and unrolls so tl.dot plus a K-loop sweeps an arbitrary contraction.

New in this cycle

  • CPU backend (--cpu). CUDA and Triton kernels compile to a host object and run natively. Headline demo:
    examples/cpu_launch_matmul.c.
  • RISC-V backend (--rv64). Same idea, RV64IMFD objects that run under qemu.
  • Cross-backend differential testing (tests/diff/). Same BIR through two backends, diff the output buffers, CPU
    is the oracle. Every case runs --inject so a green result actually means something.
  • Triton scalar math intrinsics. exp, log, sin, cos, tan, tanh, sqrt, rsqrt, abs, floor,
    ceil, maximum, minimum, fdiv. Thanks to @shivam2931120 for the PR, radians-to-turns convention done right.
  • Triton constexpr ABI compaction. tl.constexpr params with defaults fold to literals and drop out of the
    runtime signature.
  • CUDA fixes. --cpu / --rv64 on their own now run sema and don't trip the parse-dump fallback. Typedef-struct
    kernels compile through --cpu, and --parse no longer segfaults on synthetic anon names.
  • --version flag.

Targets

  • AMD GPU. --amdgpu-bin for ELF code objects (.hsaco), --amdgpu for assembly. CDNA 2 (gfx90a, MI250), CDNA
    3 (gfx942, MI300X), RDNA 2 (gfx1030), RDNA 3 (gfx1100), RDNA 4 (gfx1200).

  • NVIDIA PTX. --nvidia-ptx, defaults to sm_89. JIT-loaded via the CUDA Driver API.

  • Tenstorrent Metalium C++. --tensix, SFPU compute.

  • Tenstorrent baby cores. --rv-elf, native RV32IM ELF via the TDF layer. Integer kernels for now; soft-float
    runtime exists but not yet wired in.

  • x86-64 host object. --cpu, links and runs on Linux.

  • RV64IMFD. --rv64, runs under qemu-riscv64.

  • Apple Metal MSL. --metal, stub backend, hardware validation pending.

  • Intel SPIR-V for Xe. --intel-spirv, stub backend.

    Frontends

  • CUDA C. The same .cu files you'd feed nvcc. Real preprocessor (#include, function-like macros,
    #ifdef/#if/#elif).

  • HIP. --hip or .hip files, CUDA's AMD sibling with the same kernel-language shape.

  • Triton. @triton.jit Python source through a from-scratch lex/parse/sema/lower. Rank-1 and rank-2 tiles,
    tl.dot, K-loop tiling, the math intrinsics above.

    Mainframe curios

  • ABEND dumps (src/runtime/bc_abend.*). GPU faults become IBM-style G0Cx completion codes, correlated against
    tracked allocations with a dispatch snapshot. Fires automatically on the HSA path.

  • SNAP (--snap). Per-kernel parameter dump on entry. AMD only for now.

  • SYSPRINT. Class-tagged structured kernel output, pattern-routed sinks on the host. Demo:
    examples/sysprint_kernel.cu + examples/launch_sysprint.c.

  • TDF (Tile DataFlow). The layer above BIR for dataflow targets: regions, channels, NoC arcs, L1 placement,
    multi-core fission. Dump with --tdf.

    Runtime + tooling

  • HSA runtime launcher (src/runtime/bc_runtime.h). Loads libhsa-runtime64.so at runtime via dlopen, so the
    launcher itself has no compile-time ROCm dependency.

  • Bilingual errors (--lang <file>). Te reo Māori translation included; the format is data, so any language with
    a translation file works.

  • Optimisation passes. mem2reg, constant folding, dead code elimination. Each one is skippable (--no-mem2reg /
    --no-cfold / --no-dce) for bisection.

  • Differential testing harness (tests/diff/), as above.

    Validated on real silicon

  • AMD MI300X (CDNA 3, GFX942). 8/8 test kernels passing. Moa Monte Carlo neutron transport produces correct
    physics (k_eff = 0.995 vs reference 1.000).

  • AMD RDNA3 (GFX1100). Full test suite via the tinygrad mockgpu emulator in CI.

  • NVIDIA RTX 4060 Ti. Moa transport benchmark produces correct results with a 3.8x speedup over single-thread CPU.
    No NVCC anywhere in the pipeline.

  • Tenstorrent Blackhole. Compiles to valid Metalium C++.

This is not the full changelog as this is the first "release". See CHANGELOG.txt for the full prose.

Thanks to the people who've contributed, especially @nataliakokoromyti and @shivam2931120 and other people who've sent in tips and tricks and raised issues.