At the International Conference on Programming Language Design and Implementation (2022), scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) published a research paper titled, ‘Exocompilation for productive programming of hardware accelerators’ that proposes a new programming language, ‘Exo’, which can be used for writing high-performance code on hardware accelerators.
Register for Workshop on Real-Time Data Platforms >>
Exo is a domain-specific programming language that helps low-level performance engineers transform very simple programs which specify what they want to compute into very complex programs that do the same thing as the specification but much faster. It is both a programming language and a compiler and allows custom hardware instructions, specialised memories and accelerator configuration states to be defined in user libraries.
Exo builds on the idea of user scheduling to externalise hardware mapping and optimisation decisions.
The nudging factor
Accelerators like GPUs and image signal processors play an increasingly important role in modern computer systems. Even in CPUs, performance gains increasingly originate from new instructions by specialised functional units. These specialised hardware are more efficient than software running on general-purpose hardware. However, most applications can only achieve this performance and efficiency to the extent that key low-level libraries of high-performance kernels (such as BLAS, cuDNN, MKL and others) are optimised to exploit the hardware. Thus, the role played by high-performance kernel libraries becomes critical.
However, the performance engineers who create these high-performance low-level libraries have limited programming language support. Despite decades of work on fully automatic compiler optimisation, state-of-the-art kernels, such as those for signal processing, cryptography, deep learning and linear algebra, are still primarily written by hand, directly in low-level C and hardware-specific intrinsics or assembly, or with light metaprogramming (for example, macros or C++ templates) of such low-level code. As a result, developing and optimising these libraries requires a tremendous amount of work, which restricts the range of accelerated routines (i.e., sequences of code intended to be called and used repeatedly during the executable of a program) and makes it difficult to deploy new or improved accelerators.
With Exo, performance engineers do not need to write kernel libraries manually and can focus solely on improving performance instead of debugging the complex, optimised code.
Exo works on the principle of exocompilation. Exocompilation is a new approach to programming language and compiler support for developing hardware-accelerated high-performance libraries. Exocompilation externalises as much accelerator-specific code-generation logic and optimisation policy from the compiler as possible to high-performance library writers at the user level.
Exocompilation permits the performance engineer, rather than the compiler, to control which optimisations to use, when to use them and in what order. This enables engineers to avoid the unnecessary optimisations that compilers make automatically. Instead, Exo ensures that the optimisations are correct.
“Traditionally, a lot of research has focused on automating the optimisation process for the specific hardware. This is great for most programmers, but for performance engineers, the compiler gets in the way as often as it helps. Because the compiler’s optimisations are automatic, there’s no good way to fix it when it does the wrong thing”, said Yuka Ikarashi, PhD student at MIT CSAIL, the USA and lead author of the paper.
Another key aspect of Exocompilation is that it does away with the need to hire compiler developers. Hitherto, compiler developers have been responsible for maintaining the definition of the hardware interface. Considering that the hardware interface in most accelerator chips is usually proprietary, companies need to maintain their own copy of a whole traditional compiler modified to support their particular chips. With Exocompilation, performance engineers can describe the new chips they want to optimise without needing to modify the compiler.
Decoding the Exo system
The Exo system consists of an imperative programming language, means of defining hardware targets via libraries and a rewrite-based scheduling system.
Defining hardware in libraries comes with its own advantages—hardware vendors do not need to maintain compiler forks to protect their hardware’s proprietary details and the cost of adding support for new hardware is significantly reduced.
Rewrite-based scheduling enables Exo users to transform a simple program into an equivalent but a more complex and high-performance version targeted to the specific hardware accelerator by successive rewriting of the application.
Image source: ‘Exocompilation for productive programming of hardware accelerators’, Yuka Ikarashi (June, 2022), MIT, CSAIL, USA
Three key features of the Exo language are memories, instructions and configuration state. An Exo programmer can hand-write code to target a given accelerator or use scheduling to rewrite a simple program to target a given accelerator using these features.
How does Exo fare?
The researchers demonstrated how Exo enabled faster co-design of Gemmini’s hardware-software interface. Gemmini is an open-source machine learning accelerator. Exo made it easier for programmers to change hardware targets, which is common when developing new accelerators. The case studies further showed that Exo could be used to achieve performance competitive with state-of-the-art, highly hand-tuned libraries on x86.
“We’ve shown that we can use Exo to quickly write code that’s as performant as Intel’s hand-optimised Math Kernel Library”, says Gilbert Bernstein, Postdoctoral candidate at UC Berkeley.
Exo is currently deemed as the right fit for programmers and performance engineers working to optimise numerical programs and developing their own accelerator hardware. The researchers have plans to enable the automatic generation of runner programs that would make benchmarking easier and provide support for data-dependent accesses like histograms. In addition, they envisage a more productive scheduling meta-language and expand its semantics to support parallel programming models to apply to even more accelerators, including GPUs.