JRA1 - Activity Description
1. Motivation and objectives
Most of the supercomputing architectures, which are available today, and will be in the next years in the European High Performance Computing centers, have the same macro-architecture with the following elements: cluster of nodes interconnected through an interconnection network, nodes with multiple chips usually sharing some amount of memory, and chips with a (large) number of cores partially sharing the lower levels of the memory hierarchy and with the possibility of exploiting SIMD parallelism through some specialized or general purpose accelerators. These supercomputers vary radically in terms of number and type of these elements; however, in order to continue with the performance growth required by end-users, all these systems will be based on increasing the total number of computing elements, that can reach tens, thousands or even hundreds of thousands of cores.
For the optimal exploitation of such powerful systems, scientific applications will need to be properly adapted. This, in general, can require the adoption of new programming strategies and consequently code (re)writing or even reengineering. Novel programming models that let the programmer express algorithms and ideas in an abstract way, as far as possible from the actual hardware described above are required. At least they should promote the isolation of target architecture specificities in clearly encapsulated modules, while the global algorithmic and concurrency specification should be expressed in a general and portable way.
In this JRA, participants have got experience in terms of programming models for massively parallel architectures with complex nodes (i.e. including multi-core chips with possible accelerators). On one side our activities focussed on productive expression of parallelism using PGAS languages (UPC and CAF) in contrast to the well-established MPI standard and its hybrid use with OpenMP. On the other side, on novel proposals for the expression of parallelism in innovative node architectures that include hardware accelerators (e.g. Cell, GPU, …). The JRA also targeted the integration of OmpSs, a novel tasking-based programming model that that offers a shared address space and favours asynchrony, with new low-level accelerator paradigms such as CUDA or OpenCL, and its hybrid use with MPI. The porting and optimization of some applications using these programming models has allowed the consortium to collect recommendations and best practices for future parallel programming in the HPC-E2 sites. The work was done having in mind the two other objectives of this JRA:
* To influence the evolution of the standards related with parallel programming, in particular OpenMP and MPI.
* To provide innovative solutions to efficiently perform input/output (I/O) operations in HPC applications, exploiting the underlying hardware capabilities. An open-source library of functions to support scientific code developers to address problems and bottlenecks related to I/O (reading and/or writing files) was defined, developed and tested
2. Summary of activities and results
During the first year of this JRA, the consortium mainly focussed on studying and deploying novel programming models and starting the port of some kernels and applications to them with the aim of supporting, guiding and enabling scientific program developers and researchers in exploiting ultimate computing architectures available in the HPC-E2 computing centres. We explored the use of hybrid MPI/OpenMP (using a multi-objective genetic algorithm, Evolve-WDN from CINECA) and the use of numerical libraries (e.g. ScaLAPACK) as an alternative to it. Regarding PGAS languages, we focussed our attention to UPC and CAF, comparing their programmability and performance with the well-established MPI standard using some benchmarks (Euroben and Lattice Boltzmann kernel from HLRS). In our opinion, UPC provides straightforward programming model once initial hurdles and language limitations are understood. At that time, the implementation of CAF was less mature and still needed to provide the necessary functionality to run real applications of large-scale parallel architectures. We also put specific attention to proposals oriented to increase parallel programming productivity and target the innovative devices that are (and will be) available in modern (future) heterogeneous massive parallel architectures. We mainly focused on novel proposals for the expression of parallelism in GPUs. Two application kernels (crunch and fiddle in the crystallographic area from SARA) were ported to CUDA, observing large speedups when comparing to current multi-core architectures. BSC also defined OmpSs, released a first prototype implementation and evaluated its suitability and hybrid use with MPI with some kernel benchmarks. Regarding I/O, CINECA delivered a first version of a library of functions designed to implement easily and efficiently I/O utilities to read and write files on multiprocessors HPC architectures. The library relies on MPI I/O and is distributed open source. It was compiled and tested over different UNIX platforms and resulted in optimal efficiency in configurations accounting for tens to hundreds of processors.
During the second year of this JRA, the consortium focussed its activities on gaining more experience with the hybrid MPI/OpenMP programming using production applications and with the use of programming models targeting innovative architectural solutions in the HPC area (GPU mainly). Four applications were optimized using MPI and/or OpenMP (Splotch from CINECA, Crank from SARA, GPAW from CSC and 3D Lattice Boltzmann from HLRS). For GPUs, two applications and kernels from different domains were optimized using CUDA or OpenCL: Splotch from CINECA and Crank from SARA. UEDIN used two additional applications (Smoothed Particle Hydrodynamics and Evolutionary Tree Extraction and MCMC for Pathogen Dynamics) using the PGI directives and compiler, comparing programmability and performance with CUDA/OpenCL. The implementation of StarSs by BSC was evolved and its hybrid use with MPI and CUDA was explored and evaluated using several benchmarks (HPL, matrix multiply, jacobi, FFT, conjugate gradient, …). CINECA’s I/O library was also improved in a second release with additional functionalities and the support of large to huge parallel systems (up to tens thousands of processors).
During these two years, BSC and UEDIN have been participating in different subcommittees of the OpenMP Language Committee towards versions 3.1 and 4.0 of the standard. The work on version 3.1 mainly focussed on user-defined reductions, extensions to the atomic and tasking constructs. For version 4.0, our work focused on the development of an error model, mechanisms to specify thread to core mappings, support for accelerator devices and enhancements to OpenMP tasking, including task reductions and dependencies between tasks. The interaction has been maintained via weekly teleconferences and 2/3 face-to-face meetings per year. HLRS, as member of the standardization body for MPI, has participated in a number of face-to-face meetings during this JRA, being responsible for a number of new corrections and enhancements and major contributions in the area of scalability to a high number of nodes.