Posts
Nov, 23
AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs
The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, […]
Nov, 23
ProofWright: Towards Agentic Formal Verification of CUDA
Large Language Models (LLMs) are increasingly used to automatically generate optimized CUDA kernels, substantially improving developer productivity. However, despite rapid generation, these kernels often contain subtle correctness bugs and lack formal safety guarantees. Runtime testing is inherently unreliable – limited input coverage and reward hacking can mask incorrect behavior – while manual formal verification is […]
Nov, 23
Iris: First-Class Multi-GPU Programming Experience in Triton
Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. […]
Nov, 23
The Anatomy of a Triton Attention Kernel
A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged […]
Nov, 23
Inside VOLT: Designing an Open-Source GPU Compiler
Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors. Emerging open GPU architectures define SIMT functionality through their own ISAs, but executing existing GPU programs and optimizing performance on these ISAs relies on a compiler framework that is technically complex […]
Nov, 16
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI […]
Nov, 16
HipKittens: Fast and Furious AMD Kernels
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to […]
Nov, 16
An MLIR pipeline for offloading Fortran to FPGAs via OpenMP
With the slowing of Moore’s Law, heterogeneous computing platforms such as Field Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work we present, to the best of our knowledge, the first implementation of selective code offloading to FPGAs via the OpenMP target directive within MLIR. Our approach combines the […]
Nov, 16
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, […]
Nov, 16
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data
The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leveraging GPU parallelism for this task presents significant challenges, including bottlenecks in heterogeneous data movement, complexities in executing precision-preserving conversions, and performance degradation due to anomaly-induced sparsity. To address these […]
Nov, 9
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In […]
Nov, 9
RDMA Point-to-Point Communication for LLM Systems
Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs […]

