Posts
Jan, 12
AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis
Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these […]
Jan, 12
Equivalence Checking of ML GPU Kernels
With the rapid progress of deep learning and large language models (LLMs), companies now spend enormous sums executing GPU kernels. These kernels have, therefore, become prime targets for aggressive optimization. Recent efforts increasingly leverage LLMs to generate GPU kernels, but make no formal guarantees about the generated kernels. We present the first equivalence checker for […]
Jan, 12
DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation
Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large […]
Jan, 12
The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers
This survey has provided a systematic overview of the emerging field of LLM-enabled compilation by addressing several key research questions. We first answered how LLMs are being integrated by proposing a comprehensive, multi-dimensional taxonomy that categorizes works based on their Design Philosophy (Selector, Translator, Generator), LLM Methodology, their operational Level of Code Abstraction, and the […]
Jan, 12
ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
Parallel programming is central to HPC and AI, but producing code that is correct and fast remains challenging, especially for OpenMP GPU offload, where data movement and tuning dominate. Autonomous coding agents can compile, test, and profile on target hardware, but outputs are brittle without domain scaffolding. We present ParaCodex, an HPC-engineer workflow that turns […]
Jan, 4
Hardware Acceleration for Neural Networks: A Comprehensive Survey
Neural networks have become a dominant computational workload across cloud and edge platforms, but rapid growth in model size and deployment diversity has exposed hardware bottlenecks increasingly dominated by memory movement, communication, and irregular operators rather than peak arithmetic throughput. This survey reviews the technology landscape for hardware acceleration of deep learning, spanning GPUs and […]
Jan, 4
SeedFold: Scaling Biomolecular Structure Prediction
Highly accurate biomolecular structure prediction is a key component of developing biomolecular foundation models, and one of the most critical aspects of building foundation models is identifying the recipes for scaling the model. In this work, we present SeedFold, a folding model that successfully scales up the model capacity. Our contributions are threefold: first, we […]
Jan, 4
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges – model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as […]
Jan, 4
GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs
In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes […]
Jan, 4
Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission
Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates […]
Dec, 29
Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs
Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) […]
Dec, 29
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation
Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two […]

