The new and different architectures of modern super computers make it very hard to write high quality scientific codes without either sacrificing complicated mathematics or paying some high price. For instance, one needs to face the following walls:
write parallel codes in low-level languages, tune and optimize them is time-consuming
- Memory latency is decreasing very slowly
- Concurrency: number of codes per memory is increasing
- Technological advances on memory are very slow
Since we usually have access to different super computers, with different architectures, technologies and compilers, writing portable codes is mandatory.
- The dissipated power is about (frequency of the processor)
- Dissipated power per is limited by the cooling system
- Power costs are very expensive and are critical for Exascale for instance
Dealing with all these walls is so difficult that one cannot expect a final user (mathematicians, physicist or engineer) to deal with them. Having a clear and clean separation between low-level requirements and the high-level production is then mandatory.
A solution that is (re) gaining a huge success is based on Domain Specific Languages (DSL). Among the well known DSLs, we may mention FFTW, which is a powerful optimizing compiler for the fast Fourier transform, Exastencil which uses Scala as the host language and is heavily based on C++ meta-programming for the low-level part. Exastencil is based on a notion of multi-domain specific languages, for which ad-hoc optimizations are provided at each level. There are other interesting frameworks such as Liszt or Terra.
In a nutshell, a DSL is a specialized computer language designed for a specific task. DSLs are popular for two reasons (M. Fowler): they improve the productivity of developers and the communication with domain experts.
One of the drawbacks of developing DSLs is to provide the users with enough tools to help them writing their codes, such as debuggers. In order to avoid developing additional tools, I decided to use Python as a DSL. This means that the user will write a Python code and use it to valid his/her algorithms then convert it to a low-level code (Fortran), meaning accelerate it. This is done thanks to the Pyccel software which is a static compiler for Python.
Pyccel workflow is described in the following diagram
In addition to being able to accelerate Python codes, Pyccel allows the user to
- use MPI through MPI4py
- use BLAS and LAPACK through scipy
- use FFT
- use OpenMP and OpenACC through pragmas
- study the arithmetic and memory complexity of a given Python code
The last point is made possible thanks to the fact that Pyccel is an extension of the well established Python library SymPy. Basically, Pyccel converts a Python code to a symbolic expression tree then processes it.
The following benchmarks give an idea how well Pyccel performs compared to other Python accelerators:
As we can see, Pyccel gives very interesting results and out-performs most of the existing tools. In addition, the user can select the compiler of his/her choice.
Pyccel is the foundation of what I am developing in HPC, it allows the user to focus on algorithms and not on technical details, thanks to the power of a high-level language such as Python.
The following example, is a 4D Vlasov-Poisson written entirely in pure Python code, then accelerated using Pyccel (32 procs/node).
As we can see, we get a very good scalability up to 4096 processors. This work has been done during the Master thesis of E. Bourne and we had to perform some manual modifications on the generated Fortran code. These modifications have been introduced in Pyccel since then, through the use of two decorators: pure (for pure functions) and stack arrays.