Papers on hgpu.org (.txt-file)
__host__ __device__ — Generic programming in Cuda

.NET High Performance Computing

“Local Rank Differences” Image Feature Implemented on GPU

[Serbian] The Methods and Procedures for Accelerating Operations and Queries in Large Database Systems and Data Warehouse (Big Data Systems)

10×10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency

190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

2-D Impulse Noise Suppression by Recursive Gaussian Maximum Likelihood Estimation

24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs

2D and 3D level-set algorithms on GPU
2D Image Convolution using Three Parallel Programming Models on the Xeon Phi

2D Triangulation of Polygons on CUDA

2D/3D image registration on the GPU

2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation

2PARMA: Parallel Paradigms and Run-time Management Techniques for Many-Core Architectures

3-SAT on CUDA: Towards a massively parallel SAT solver

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

3D data denoising via Non-Local means filter by using parallel GPU strategies

3D Edge Bundling for Geographical Data Visualization

3D finite difference computation on GPUs using CUDA

3D finite element numerical integration on GPUs

3D GPU Architecture using Cache Stacking: Performance, Cost, Power and Thermal analysis

3D Haar-Like Elliptical Features for Object Classification in Microscopy

3D Hydrodynamic Simulation of Classical Nova Explosions

3D Information Extraction Based on GPU

3D Modeling, Distance and Gradient Computation for Motion Planning: A Direct GPGPU Approach

3D Non-Local Means denoising via multi-GPU

3D nonrigid registration via optimal mass transport on the GPU

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

3D Object Recognition with Convolutional Neural Networks

3D Objects Tracking by GPGPU-Enhanced Particle Filter Algorithms

3D Recursive Gaussian IIR on GPU and FPGAs: A Case Study for Accelerating Bandwidth-Bounded Applications

3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation

3D simulation of complex shading affecting PV systems taking benefit from the power of graphics cards developed for the video game industry

3D Skeleton Extraction Method using Potential Field on OpenCL

3D tumor localization through real-time volumetric x-ray imaging for lung cancer radiotherapy

3D vision of electromagnetic fields in antenna and microwave technique
3D visualization of astronomy data cubes using immersive displays

3DES ECB Optimized for Massively Parallel CUDA GPU Architecture

3I: A tool for visualizing and processing in parallel 2D & 3D images

42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

4kUHD H264 wireless live video streaming using CUDA

5.6: GPU enhancement of FDTD-PIC plasma-wave simulations
8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

A (ir)regularity-aware task scheduler for heterogeneous platforms

A (Somewhat Dated) Comparative Study of Betweenness Centrality Algorithms on GPU

A 3D Convex Hull Algorithm for Graphics Hardware

A 3D radiative transfer framework: XIII. OpenCL implementation

A 3D radiative transfer framework. VIII. OpenCL implementation

A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks

A balanced programming model for emerging heterogeneous multicore systems

A Batched GPU Algorithm for Set Intersection

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

A Bi-objective Optimization Framework for Query Plans

A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers

A block-asynchronous relaxation method for graphics processing units

A Braille Conversion Service Using GPU and Human Interaction by Computer Vision

A breadth-first course in multicore and manycore programming

A capabilities-aware framework for using computational accelerators in data-intensive computing

A Case Against Small Data Types on GPGPUs

A Case for Work-stealing on FPGAs with OpenCL Atomics

A Case Study for Petascale Applications in Astrophysics: Simulating Gamma-Ray Bursts

A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software

A Case Study of OpenCL on an Android Mobile GPU

A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU
A case study on porting scientific applications to GPU/CUDA

A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL

A CG-based Poisson solver on a GPU-cluster
A characterization and analysis of PTX kernels

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using Multi-GPU
A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

A Cloud Computing Service Architecture of a Parallel Algorithm Oriented to Scientific Computing with CUDA and Monte Carlo

A cluster for CS education in the manycore era

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

A Co-Prime Blur Scheme for Data Security in Video Surveillance

A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics
A code motion technique for accelerating general-purpose computation on the GPU

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

A Code Transformation Framework for Scientific Applications on Structured Grids

A code-based analytical approach for using separate device coprocessors in computing systems
A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques

A collision detection algorithm using adaptive particle sensor
A combined MPI-CUDA parallel solution of linear and nonlinear Poisson-Boltzmann equation

A Common GPU n-Dimensional Array for Python and C

A Comparative Analysis of GPU Implementations of Spectral Unmixing Algorithms

A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems

A comparative benchmarking of the FFT on Fermi and Evergreen GPUs
A Comparative Measurement Study of Deep Learning as a Service Framework

A Comparative Study of 2D Numerical Methods with GPU Computing

A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++

A Comparative Study of Game Tree Searching Methods

A comparative study of GPU programming models and architectures using neural networks
Titles: 100
open PDFs: 88
packages: 9
