high performance computing on graphics processing units: hgpu.org

Posts

May, 20

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields […]

May, 20

Pretraining large language models with MXFP4 on Native FP4 Hardware

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In […]

May, 20

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applications, and unsolved challenge tasks across Ampere, Hopper, and […]

CUDA

May, 20

Source-to-Source Transformations for GPU Code Generation

GPUs have become essential in modern high performance computing, but programming them correctly remains a significant challenge. This difficulty arises from subtle concurrency bugs that result from the explicit management of synchronization primitives and data movement across intricate hierarchies of memory and parallel threads. At the same time, the ability to control these aspects explicitly […]

CUDA

May, 20

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA […]

CUDA

May, 20

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

Large numbers of small tensor kernels are executed by GPUs in modern deep learning frameworks, where total performance is frequently constrained by memory bandwidth and kernel launch overheads. Systems such as TensorFlow XLA, PyTorch JIT, and cuDNN often use kernel fusion, which is defined as combining many tensor operations into a single GPU kernel, to […]

CUDA

May, 11

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application — but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for […]

May, 11

KEET: Explaining Performance of GPU Kernels Using LLM Agents

Performance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool’s graphical interface to identify and understand kernel performance […]

CUDA

May, 11

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achieve high efficiency but are difficult to adapt. Recent work explores large language models (LLMs) for GPU kernel generation, but prior studies […]

CUDA

May, 11

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the model captures Tensor Memory (TMEM), asynchronous bulk […]

CUDA

May, 11

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, […]

May, 3

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, […]

high performance computing on graphics processing units: hgpu.org

Posts

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Pretraining large language models with MXFP4 on Native FP4 Hardware

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

Source-to-Source Transformations for GPU Code Generation

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

KEET: Explaining Performance of GPU Kernels Using LLM Agents

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Recent source codes

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Most viewed papers (last 30 days)