Algebraic enhancements for GEMM & AI accelerators
-
Updated
Feb 28, 2025 - Python
Algebraic enhancements for GEMM & AI accelerators
Minimal TPU implementation with 8x8 systolic array and PyTorch integration
Open-source AI Accelerator Stack integrating compute, memory, and software — from RTL to PyTorch.
SystemVerilog Implementations of CUDA/TensorCore/TPU GEMM Operations
Modular systolic array with software interface
Hardware accelerator for 2D convolution using an 8×8 weight-stationary systolic array with split-kernel support, dual-port SRAM architecture, and DMA-based streaming
Parameterized N×N output-stationary systolic array accelerator for INT8 neural network inference. Full RTL-to-GDS flow on ASAP7 7nm using Cadence Genus + Innovus. 667 MHz, 42.7 GOPS peak throughput, 0.33 mW/GOP. SystemVerilog RTL, synthesis, place-and-route and self-checking testbench included.
A high-performance INT8 Matrix Multiplication Accelerator implemented in pure Verilog, optimized for Edge AI inference on Xilinx Kria KV260 (Zynq UltraScale+ MPSoC).
High-performance systolic array computing framework with AI agents and medical compliance.
Deadline-constrained SCALE-Sim and Accelergy experiment for image-processing systolic arrays
INT8 Systolic-Array AI Accelerator on Zynq SoC with HW-SW Co-Design and Roofline Performance Analysis
Three Logisim implementations of matrix multiplication — Standard, Systolic TPU, and Tiling. Built from scratch in digital logic.
4×4 7-bit matrix multiplication hardware accelerator using a systolic array, with a Python driver for the Basys 3 FPGA and a systolic array UVC using UVM.
An AXI-native 8x8 systolic array accelerator in Verilog. Features pure dataflow pipelining, Q-format fixed-point arithmetic, and hardware validation on the Kria KV260 FPGA.
Small-scale FPGA-based Neural Processing Unit (CNN Accelerator) with INT8 systolic array matrix multiplication in Verilog.
𓁰 PtahCore — open-source FP8 tensor accelerator, RTL→GDSII on open 7nm ASAP7, 100% open toolchain
A high-performance, production-grade 64-bit RISC-V Multicore SoC ecosystem and industry-standard Cadence ASIC CAD flow (Genus/Innovus). Fully integrated 5-hart coherent core complex, TileLink interconnect, custom RoCC ML Systolic Array, PCIe, USB, HDMI, and silicon-proven IP blocks.
An 8×8 systolic array AI accelerator implemented in SystemVerilog on Zynq UltraScale+ ZCU104, achieving 1.7 GOPS at 6 mW PL logic power (~283 GOPS/W efficiency) with full AXI-Stream PS-PL integration. Targets INT8 matrix multiplication for transformer inference acceleration, verified across behavioral, post-synthesis and implementation simulation.
Reproducibility artifact for IEEE TC paper TC-2025-09-0830 'Characterizing and Accelerating Spacecraft Onboard Workloads on RISC-V Platform': 28 synthetic spacecraft workloads, gem5 with NOVA RVTrig+RVMatrix ISA extensions, FPGA RTL (VU9P), McPAT models, and RISC-V toolchain patches.
Add a description, image, and links to the systolic-array topic page so that developers can more easily learn about it.
To associate your repository with the systolic-array topic, visit your repo's landing page and select "manage topics."