Graphics Processing Units (GPUs) support dynamic voltage and frequency scaling (DVFS) in order to balance computational performance and energy consumption. However, there still lacks simple and accurate performance estimation of a given GPU kernel under different frequency settings on real hardware, which is important to decide best frequency configuration for energy saving. This paper reveals a fine-grained model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Over a 2.5x range of both core and memory frequencies among 12 GPU kernels, our model achieves accurate results (within 3.5%) on real hardware. Compared with the cycle-level simulators, our model only needs some simple micro-benchmark to extract a set of hardware parameters and performance counters of the kernels to produce this high accuracy.
Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11X increase in kernels throughput at a very low overhead (less than 2X increase in FPGA resources).
As the trends of process scaling make memory system even more crucial bottleneck, the importance of latency hiding techniques such as prefetching grows further. However, naively using prefetching can harm performance and energy efficiency and hence, several factors and parameters need to be taken into account to fully realize its potential. In this paper, we survey several recent techniques that aim to improve implementation and effectiveness of prefetching. We characterize the techniques on several parameters to highlight their similarities and differences. The aim of this survey is to provide insights to researchers into working of prefetching techniques and spark interesting future work for improving the performance advantages of prefetching even further.
Specifications
- GPUCuracao XT
- Tech process28 nm
- GPU architectureGCN 1.0
- GCN Stream processors1280
- Performance (single precision)2688 GFlops
- Performance (double precision)168 GFlops
- Max resolution per display4096 x 2160
- Display outputs3 DVI/HDMI, 4 DisplayPorts
- Core Clock1000 MHz
- Memory Clock1400 MHz
- Effective Memory Clock5600 MHz
- Memory TypeGDDR5
- Amount of memory2048/4096 MB
- Memory Bandwidth179.2 GB/sec
- Buswidth256 bit
- InterfacePCI-Express 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Maximum TDP180W
Retail Cards Based On This Board
2 GB
- ASUS DirectCU II R9270X-DC2T-2GD5 Radeon R9 270X 2GB
- GIGABYTE GV-R927XOC-2GD Radeon R9 270X 2GB
- PowerColor PCS+ AXR9 270X 2GBD5-PPDHE Radeon R9 270X 2GB
- MSI R9 270X GAMING 2G Radeon R9 270X 2GB
- XFX R9-270X-CDBC Radeon R9 270X 2GB
4 GB
Reviews
Based on Newegg.com and manufacturers data
Specifications
- GPUHawaii XT
- Tech process28 nm
- GCN Stream processors2816
- Performance (single precision)5070 GFlops
- Performance (double precision)2530 GFlops
- Display outputsNo
- Core Clock900 MHz
- Memory Clock1375 MHz
- Effective Memory Clock5500 MHz
- Memory TypeGDDR5
- ECCExternal
- Amount of memory16384 MB
- Memory Bandwidth320 GB/sec
- Buswidth512 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version2.0/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP225W
- Warranty3 years
Retail Cards Based On This Board
16 GB
Reviews
Based on amd.com data
Specifications
- GPUHawaii XT
- Tech process28 nm
- GCN Stream processors2560
- Performance (single precision)4220 GFlops
- Performance (double precision)2110 GFlops
- Display outputsNo
- Memory TypeGDDR5
- ECCExternal
- Amount of memory12288 MB
- Memory Bandwidth320 GB/sec
- Buswidth512 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version2.0/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP225W
- Warranty3 years
Retail Cards Based On This Board
12 GB
Reviews
Based on amd.com data
Specifications
- GPUTahiti PRO
- Tech process28 nm
- GCN Stream processors1792
- Performance (single precision)3225.6 GFlops
- Performance (double precision)806.4 GFlops
- Display outputs1 DisplayPort
- Core Clock900 MHz
- Memory Clock1375 MHz
- Effective Memory Clock5500 MHz
- Memory TypeGDDR5
- ECCInternal/External
- Amount of memory12288 MB
- Memory Bandwidth264 GB/sec
- Buswidth384 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP225W
- Warranty3 years
Retail Cards Based On This Board
12 GB
Reviews
Based on amd.com data
Specifications
- GPUTahiti PRO
- Tech process28 nm
- GCN Stream processors1792
- Performance (single precision)3225.6 GFlops
- Performance (double precision)806.4 GFlops
- Display outputs1 DisplayPort
- Core Clock900 MHz
- Memory Clock1375 MHz
- Effective Memory Clock5500 MHz
- Memory TypeGDDR5
- ECCInternal/External
- Amount of memory6144 MB
- Memory Bandwidth264 GB/sec
- Buswidth384 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP225W
- Warranty3 years
Retail Cards Based On This Board
6 GB
Reviews
Based on amd.com data
Specifications
- GPUPitcairn XT
- Tech process28 nm
- GCN Stream processors1280
- Performance (single precision)2432 GFlops
- Performance (double precision)152 GFlops
- Display outputs1 DisplayPort
- Core Clock950 MHz
- Memory Clock1200 MHz
- Effective Memory Clock4800 MHz
- Memory TypeGDDR5
- ECCNo
- Amount of memory4096 MB
- Memory Bandwidth153.6 GB/sec
- Buswidth256 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP150W
- Warranty3 years
Retail Cards Based On This Board
4 GB
Reviews
Based on amd.com data
Specifications
- GPUPitcairn LE
- Tech process28 nm
- GCN Stream processors768
- Performance (single precision)1267.2 GFlops
- Performance (double precision)79.2 GFlops
- Max resolution per display2560 x 1600
- Display outputs2 Mini DisplayPort
- Core Clock825 MHz
- Memory Clock800 MHz
- Effective Memory Clock3200 MHz
- Memory TypeGDDR5
- ECCNo
- Amount of memory2048 MB
- Memory Bandwidth102.4 GB/sec
- Buswidth256 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP150W
- Warranty3 years
Retail Cards Based On This Board
2 GB
Reviews
Based on amd.com data
Specifications
- GPUTahiti PRO
- Tech process28 nm
- GCN Stream processors2 x 1792
- Performance (single precision)5913.6 GFlops
- Performance (double precision)1478.4 GFlops
- Max resolution per display4096 x 2160
- Display outputs1 Mini DisplayPort
- Core Clock825 MHz
- Memory Clock1250 MHz
- Effective Memory Clock5000 MHz
- Memory TypeGDDR5
- ECCYes
- Amount of memory2 x 3072/ 2 x 6144 MB
- Memory Bandwidth2 x 240 GB/sec
- Buswidth384 bit
- InterfacePCIe 3.0 x16
- OpenCL/OpenGL version1.2/4.4
- DirectX compliance12
- Shader model5
- Maximum TDP375W
- Warranty3 years
Retail Cards Based On This Board
6 GB
12 GB
Reviews
Based on amd.com data
Specifications
- GPUTurks
- Tech process40 nm
- Stream processors400
- Performance (single precision)624 GFlops
- Performance (double precision)–
- Max resolution per display2560 x 1600
- Display outputs1 DisplayPort, 1 DVI-I
- Core Clock650 MHz
- Memory Clock900 MHz
- Effective Memory Clock1800 MHz
- Memory TypeGDDR3
- ECCNo
- Amount of memory1024 MB
- Memory Bandwidth28.8 GB/sec
- Buswidth128 bit
- InterfacePCIe 2.1 x16
- OpenGL version4.4
- DirectX compliance11
- Shader model5
- Maximum TDP50W
- Warranty3 years
Retail Cards Based On This Board
1 GB
Reviews
Based on amd.com data