This PhD subject is intended for students holding a M2 (or engineer) degree in computer science or high-performance computing.
Advisor: Thomas Padioleau email@example.com (+33 (0)1 69 08 10 37)
Director: Julien Bigot firstname.lastname@example.org (+33 (0)1 69 08 01 75)
Keywords: parallel programming models, software engineering, modern C++, partitioned global address space, performance portability, GPU, extreme-scale parallelism
Extreme-scale simulation codes are typically developed in C or Fortran, parallelized with MPI and OpenMP. This approach has proven very successful for code portability during the last decades; but with the advent of GPU-based supercomputers, codes have to be ported to a new architecture for which a rewrite from scratch is often required.
Lots of efforts have been devoted recently to design programming models targeting (performance) portability between CPU and GPU including Kokkos, OpenACC, RAJA, or SYCL for
example. Some also target distributed memory parallelism like Co-array Fortran, HPX, UPC, UPC++, or XMP. These models rely on the same abstractions that were present in previous models: (multi-dimensional) arrays and (parallel) loops. Numerical applications then build higher-
level abstractions on top of these by taking into account the specifics of the hardware they target, hence limiting performance portability. Many concepts also remain implicit in the code so as not to pay a cost at execution; leading to issues with code maintainability and adaptability.
These issues can only be solved by directly providing higher-level abstractions to the numerical simulation codes. An interesting example is offered by numpy,
dask.Array in Python. These libraries make it possible to express numerical computations in a very natural way thanks to the abstraction of mesh and data distribution they provide and to execute it on a variety of hardware architecture by adding just a few dedicated annotations. However, this ecosystem remains far from offering performances comparable with Fortran + OpenMP/MPI.
The goal of this PhD thesis is to evaluate if a solution based on C++ template metaprogramming can offer high-level (zero-cost) abstractions handling a large range of data discretization at compile time. The work will take place in the framework of the ddc library and will be evaluated on the very demanding simulation code GYSELA that leverage the largest existing super computers and manipulates multiple complex discretizations of its high-dimension data along execution. The approach will have to handle seamless replacement of discretization in code (e.g. structured uniform mesh to unstructured), while offering the best performance from each one. The approach will also have to handle parallelism at all levels: distributed-memory parallelism similarly to PGAS languages, shared-memory parallelism on both CPU and GPU, but also SIMD parallelism.
During the first phase of the work, the candidate will study the related bibliography (see an extract bellow), focusing on parallel programming models, and especially PGAS languages to get a good understanding of the involved concepts, understand the limitations of current approaches, and seize the benefits of existing softwares. In addition to academic publications, this first phase will be used to discover existing work in ddc and the needs identified as part of the rewrite of the GYSELA code.
Then, the work will focus on the conception of a programming model supporting the various features identified in the previous section. This work will start by focusing on single-node CPU and GPU parallelism before moving to multi-node distributed parallelism. The proposed concepts will be tested on academic cases using the usual evaluation criteria from the literature: portability, performance, lines of code, readability, etc.
Finally, the proposed solutions will be implemented in ddc and applied to the development of the next-generation production code GyselaX in collaboration with the team developing the code at CEA/IRFM, Cadarache and with collaborators working on performance optimization and portability in JAEA, Tokyo, Japan. They will be tested at scale on several supercomputers including systems amongst the most powerful in the world (Fugaku, Joliot-Curie, Adastra, …) but also on testbeds of new GPU accelerators in the framework of collaborations with vendors such as Intel. All results will be submitted to international peer-reviewed conferences and journals, such as SuperComputing, IPDPS, Cluster, JPDC, etc. for publication. The candidate will also be encouraged to participate in the writing of a proposal to the C++ standard comittee about the solutions found.
The successful candidate will master the following skills and knowledge:
- strong interest in programming models in general and parallel programming models in particular,
- proficiency and experience with modern C++ and template metaprogramming,
- motivation for team-work in an international environment.
In addition, the following will be considered a plus:
- knowledge and experience with GPU, parallel and high-performance computing,
- interest for applied mathematics and numerical simulation,
- experience with software engineering and library design.