Posts
Sep, 14
Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs
This thesis aims at providing the ground work to facilitate a performance estimation model for CUDA kernels using a cycle counting model. After a short overview of past GPU performance modeling techniques, it conducts an exhaustive, in-depth analysis of Nvidia’s SASS instruction set and CUDA ELF formats for architectures Maxwell up to and including Blackwell, […]
Sep, 14
An HPC Benchmark Survey and Taxonomy for Characterization
The field of High-Performance Computing (HPC) is defined by providing computing devices with highest performance for a variety of demanding scientific users. The tight co-design relationship between HPC providers and users propels the field forward, paired with technological improvements, achieving continuously higher performance and resource utilization. A key device for system architects, architecture researchers, and […]
Sep, 14
Home-made Diffusion Model from Scratch to Hatch
We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024×1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: […]
Sep, 14
High Performance Matrix Multiplication
Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares […]
Sep, 14
Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software
The HPEC Graph Challenge is a collection of benchmarks representing complex workloads that test the hardware and software components of HPC systems, which traditional benchmarks, such as LINPACK, do not. The first benchmark, Subgraph Isomorphism, focused on several compute-bound and memory-bound kernels. The most recent of the challenges, the Anonymized Network Sensing Graph Challenge, represents […]
Sep, 7
CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation
We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, […]
Sep, 7
AnnotationGym: A Generic Framework for Automatic Source Code Annotation
A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary […]
Sep, 7
Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems
We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel […]
Sep, 7
GPU-acceleration of the Discontinuous Galerkin Shallow Water Equations Solver (DG-SWEM) using CUDA and OpenACC
This paper presents a porting of DG-SWEM, a discontinuous Galerkin solver for coastal ocean circulation, and in particular storm surge, to GPU using two separate approaches: CUDA Fortran and OpenACC. Time-explicit discontinuous Galerkin methods have been shown to exhibit a large amount of data parallelism due to the loose coupling between elements, and thus are […]
Sep, 7
Managing Multi Instance GPUs for High Throughput and Energy Savings
Focus to learn morModern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In […]
Aug, 31
Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture
This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic resource scheduling and seamless integration of containerized microservices have been leveraged herein to manage CPU, GPU, and memory allocations efficiently in multi-node […]
Aug, 31
Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
Discrete GPUs are a cornerstone of HPC and data center systems, requiring management of separate CPU and GPU memory spaces. Unified Virtual Memory (UVM) has been proposed to ease the burden of memory management; however, at a high cost in performance. The recent introduction of AMD’s MI300A Accelerated Processing Units (APUs)–as deployed in the El […]