high performance computing on graphics processing units: hgpu.org

Posts

Jul, 20

Kevin: Multi-Turn RL for Generating CUDA Kernels

Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this […]

CUDA

Jul, 20

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

This work focuses on the use of deep reinforcement learning (DRL) to automate code optimization within modern compiler infrastructures. Code optimization is a critical step in program transformation that aims to improve performance and reduce resource consumption while preserving correctness. Traditional approaches to code optimization rely on manual or heuristic-based methods, which are often time-consuming […]

Jul, 20

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. […]

CUDA

Jul, 20

Pre-Training LLMs on a budget: A comparison of three optimizers

Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and […]

CUDA

Jul, 20

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU performance features with thought through microbenchmarks. We unveil key subsystems, including the memory hierarchy, SM execution pipelines, and the SM […]

CUDA

Jul, 13

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still […]

CUDA

Jul, 13

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, […]

Jul, 13

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it […]

CUDA

Jul, 13

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 […]

Jul, 13

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

As GPU-using tasks become more common in embedded, safety-critical systems, efficiency demands necessitate sharing a single GPU among multiple tasks. Unfortunately, existing ways to schedule multiple tasks onto a GPU often either result in a loss of ability to meet deadlines, or a loss of efficiency. In this work, we develop a system-level spatial compute […]

CUDA

Jul, 6

Efficient GPU Implementation of Multi-Precision Integer Division

Efficient arithmetic on multi-precision integers is a cornerstone of many scientific and cryptographic applications that require computations on integers that exceed the native sizes supported by modern processors. While GPU-efficient addition and multiplication has been well explored, division has been subject to less attention due to its greater algorithmic complexity. This thesis attempts to bridge […]

CUDA

Jul, 6

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Kevin: Multi-Turn RL for Generating CUDA Kernels

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Pre-Training LLMs on a budget: A comparison of three optimizers

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)