John Wikman

Automatic GPU optimization through higher-order functions in functional languages

Abstract

Over recent years, GPUs have become popular devices to use in procedures that inhibit data-parallelism. Due to high parallel capability, running procedures on a GPU can result in an execution time speedup ranging from a couple times faster to several orders of magnitude faster, compared to executing serially on a CPU. Interfaces such as CUDA and OpenCL flexibly exposes the parallel capabilities of the GPU to the programmer, while at the same time putting a lot of responsibility on the programmer to handle aspects such as thread synchronization and memory management.

A different approach to GPU optimization is to enable it through higher-order functions with known data-parallelism, using the semantics of the higher-order function to determine the parallel execution. This approach has in practice been integrated into existing languages through libraries or been integrated directly into languages themselves.

However, higher-order functions do not address when it is beneficial to execute on a GPU. Due to the GPU being a separate device, effects such as latency and memory transfer can cause a slowdown for small input values.

In this thesis, a set of commonly used higher-order functions are GPU enabled as compiler intrinsics in a small functional language. These higher-order functions are also equipped with the option of automatically deciding at runtime if to execute on GPU or CPU.

Results show that running higher-order functions on GPU yields a speedup for larger computations. However, the performance does not match existing solutions that provide additional higher-order functions for optimizing the parallelization.

The selected approach for automatically deciding whether to run a higher-order function on GPU or on CPU results in the faster option a majority of cases. Though the most notable benefit of automatic decisions was for procedures that use multiple higher-order function invocations, which ran faster compared to when executing only on GPU or only on CPU.