Data-Redistribution for Malleable Algorithms
This Master's Thesis focuses on the concept of malleability in algorithms parallelized with the Message Passing Interface (MPI) in the field of High Performance Computing (HPC).
Malleability allows algorithms to dynamically adapt to changes in the number of available processors, improving load balancing, ease of scalability, and resource utilization. It thus enhances the efficiency of large-scale computing systems by maximizing the use of available resources.
MPI is a popular standard for HPC which provides communication primitives for parallel algorithms. The upcoming MPI 5 standard is set to support malleability.
One of the main challenges in developing a malleable algorithm is the redistribution of static data after a resize. Keeping all the data at each processor is not always feasible or efficient, especially in large-scale computing environments where the amount of data can be massive. Sharding the data across the processes decreases the memory overhead, network transfer times, and improves the scaleability of the algorithm.
ReStore is a C++ library designed to recover data after process failures in a fault-tolerant distributed application. It allows the user to define an arbitrary mapping of data to processes and is optimized for rapid and scalable recovery of data.
The ability of ReStore to recover data after a failure can also be applied to the redistribution of data after the set of processes has shrunk or expanded.
The goal of this thesis is to implement a malleable algorithm and use ReStore as a tool to redistribute its static data. Using the lessons learned in this implementation the goal is to develop an efficient and scalable general purpose framework for redistributing data in a malleable setting and evaluate its performance.
• Interest in parallel algorithms and data structures
• Excellent knowledge of modern C++ or Rust
• Experience in implementing distributed algorithms
• Willingness to engage in meticulous scientific work