Posts
Mar, 4
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as this http URL for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune […]
Mar, 4
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To […]
Mar, 4
CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is […]
Mar, 1
Joint Training on AMD and NVIDIA GPUs
As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel […]
Mar, 1
A Survey of Recent Developments in SYCL Compiler Implementations
This survey discusses recent advancements in SYCL compiler implementations, one of the crucial aspects of compiler construction for heterogeneous computing systems. We explore the transition from traditional compiler construction, from Single-Source Multiple Compiler Passes (SMCP) to a more advanced approach to Single-Source Single Compiler Pass (SSCP). The survey analyzes multiple papers that researched the different […]
Mar, 1
From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation
Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based parallel code from three kinds of input prompts: natural language problem descriptions, sequential reference implementations, and parallel pseudo code. We focus on three programming frameworks: OpenMP Tasking, […]
Mar, 1
CL4SE: A Context Learning Benchmark For Software Engineering Tasks
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different […]
Mar, 1
CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for […]
Feb, 23
HPC++: An LLVM-Based Automatic Parallelization Framework with Heterogeneous CPU–GPU Execution
We present HPC++, an automatic parallelization framework that transforms sequential C++ programs into efficient parallel implementations targeting both multi-core CPUs and OpenCL-capable GPUs. Operating at the LLVM Intermediate Representation (IR) level, HPC++ performs pattern-driven analysis to detect seven distinct parallelization strategies—including reductions, elementwise maps, matrix multiplications, nested loops, search operations, histogram patterns, and independent function […]
Feb, 23
KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning
Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge […]
Feb, 23
Fine-Tuning GPT-5 for GPU Kernel Generation
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of […]
Feb, 23
OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization
Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although large language models can synthesize functionally correct CUDA code, achieving competitive performance requires systematic exploration and verification of optimization choices. We present OptiML, an end-to-end framework that maps either natural-language […]

