Posts
Apr, 13
Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs
Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in […]
Apr, 13
Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework
CUDA (Compute Unified Device Architecture) parallel programming significantly improves computational efficiency across multiple fields. However, converting serial C code to CUDA poses challenges for non-experts, and traditional tools struggle with complex patterns. While LLMs (Large Language Models) enable automatic parallelization of complex patterns, they may generate CUDA code with synchronization and memory management issues. There […]
Apr, 13
GigaAPI for GPU Parallelization
GigaAPI is a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential. The API offers a comprehensive set of functionalities, including fundamental GPU operations, image processing, and complex GPU tasks, abstracting away the intricacies of low-level CUDA and […]
Apr, 13
GPU-centric Communication Schemes for HPC and ML Applications
Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its […]
Apr, 13
A Power-Efficient Scheduling Approach in a Cpu-Gpu Computing System by Thread-Based Parallel Programming
Due to their high computing performance, CPU-GPU heterogeneous computing platforms are widely used in mobile devices such as smart phones, tablet computers, and unmanned aerial vehicles. Because a mobile device is often powered by a battery, how to elegantly design a power-efficient real-time computing system becomes an important problem. In this paper, we propose a […]
Mar, 30
Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle
Currently, the most energy-efficient hardware platforms for floating point-intensive calculations (also known as High Performance Computing, or HPC) are graphical processing units (GPUs). However, porting existing scientific codes to GPUs can be far from trivial. This article summarizes our recent advances in enabling machine-assisted, HPC-oriented refactorings with reference to existing APIs and programming idioms available […]
Mar, 30
PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch
CUDA Graphs — a recent hardware feature introduced for NVIDIA GPUs — aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data […]
Mar, 30
Efficient allocation of image recognition and LLM tasks on multi-GPU system
This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description […]
Mar, 30
Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing
There is a growing interest in the computer architecture community to incorporate heterogeneity and specialization to improve performance. Developers can write heterogeneous applications that consist of host code and kernel code, where compute-intensive kernels can be offloaded from CPU to GPU, FPGA, or quantum computer. However, the high complexity of these systems can pose challenges […]
Mar, 30
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The […]
Mar, 30
Analyzing Modern NVIDIA GPU cores
GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and […]
Mar, 23
The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs
We present Shamrock, a performance portable framework developed in C++17 with the SYCL programming standard, tailored for numerical astrophysics on Exascale architectures. The core of Shamrock is an accelerated parallel tree with negligible construction time, whose efficiency is based on binary algebra. The Smoothed Particle Hydrodynamics algorithm of the Phantom code is implemented in Shamrock. […]