Sorting algorithms for distributed memory systems are essential for quickly sorting large amounts of data. The de facto standard for communication in High Performance Computing (HPC) is the Message Passing Interface (MPI). MPI uses the concept of communicators to connect groups of PEs. Recursive algorithms may split a group of PEs into subgroups and then execute the algorithm on these subgroups recursively. A straightforward approach to implementing such an algorithm is to create a MPI subcommunicator for each subgroup. However, the time to create a MPI communicator is not negligible on a large group of PEs and may therefore drastically increase the running time of the algorithm.
In this thesis, we present a communication library based on MPI that supports communicator creation in constant time and without communication. The library supports both point-to-point communication and non-blocking collective operations. We also present the first efficient implementation of Schizophrenic Quicksort, a recursive sorting algorithm for distributed memory systems that is based on Quicksort. Schizophrenic Quicksort guarantees perfect data balance with the drawback that some PEs have to work on two groups simultaneously. The only previous implementation scales bad on large supercomputers. We integrate our communication library into the implementation of Schizophrenic Quicksort and thus eliminate the cost for the communicator creation almost completely. We also avoid problems caused by blocking collective operations. We also implement a better pivot selection and reduce the amount of communication in each level of recursion.
Furthermore, we extend an implementation of Hypercube Quicksort with our communication library.
We perform an extensive experimental evaluation of our implementations. In our experiments, our library reduces the time to create a communicator by a factor of more than 10000 compared to MPI. In contrast, the collective operations of MPI outperformed the collective operations of our library for most inputs. On large numbers of PEs, splitting a communicator and executing the operation Broadcast 50 times (with medium inputs) with our library is still faster than a single communicator split with MPI. Our experimental results show that we improve the performance of Schizophrenic Quicksort by a factor of up to 40 and the performance of Hypercube Quicksort by a factor of up to 15 if we use our library instead of MPI. We compare our implementation of Schizophrenic Quicksort with the implementation of Hypercube Quicksort. The results indicate that Schizophrenic Quicksort is not able to outperform Hypercube Quicksort. However, Schizophrenic Quicksort runs on any number of PEs while Hypercube Quicksort only runs on 2^k PEs.