Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

Author(s):
Timo Bingmann

Source:
Dissertation (2017)
Institute of Theoretical Informatics  Date: July 2018
Abstract
This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on sharedmemory multicore machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.
Sorting strings or vectors is a basic algorithmic challenge different from integer sorting because it is important to access components of the keys to avoid repeated operations on the entire string. We focus on sorting large inputs which fit into the RAM of a sharedmemory machine. String sorting is needed for instance in database index construction, suffix sorting algorithms, and to order highdimensional geometric data.
We first survey engineered variants of basic sequential string sorting algorithms and perform an extensive experimental evaluation to measure their performance. Furthermore, we perform experiments to quantify parallel memory bandwidth and latency experiments as preliminary work for designing parallel string sorting algorithms.
We then propose string sample sort as an adaptation of sample sort to string objects and present its engineered version Super Scalar String Sample Sort. This parallelready algorithm runs in O(D/w + n log n) expected time, makes effective use of the cache hierarchy, uses word and instructionlevel parallelism, and avoids branch mispredictions. Our parallelization named Parallel Super Scalar String Sample Sort (pS^{5}) employs voluntary work sharing for load balancing and is the overall best performing algorithm on singlesocket multicore machines in our experiments.
For platforms with nonuniform memory access (NUMA) we propose to run pS^{5} on each NUMA node independently and then merge the sorted string sequences. To accelerate the merge with longest common prefix (LCP) values we present a new LCPaware multiway merge algorithm using a tournament tree. The merge algorithm is also used to construct a standalone LCPaware Kway mergesort, which runs in O(D + n log n + n/K) time and benefits from long common prefixes in the input.
Broadly speaking, we propose both multiway distributionbased with string sample sort and multiway mergebased string sorting with LCPaware merge and mergesort, and engineer and parallelize both approaches. We also present parallelizations of multikey quicksort and radix sort, and perform an extensive experimental evaluation using six machines and seven inputs. For all input instances, except random strings and URLs, pS^{5} achieves higher speedups on modern singlesocket multicore machines than our own parallel multikey quicksort and radix sort implementations, which are already better than any previous ones. On multisocket NUMA machines pS^{5} combined with the LCPaware toplevel multiway merging was fastest on most inputs.
We then turn our focus to suffix sorting, which is equivalent to suffix array construction. The suffix array is one of the most popular text indexes and can be used for fast substring search in DNA or text corpora, in compression applications, and is the basis for many string algorithms. When augmented with the LCP array and additional tables, the suffix array can emulate the suffix tree in a myriad of stringology algorithms. Our goal is to create fast and scalable suffix sorting algorithms to generate large suffix arrays for realworld inputs. As introduction to suffix array construction, we first present a brief survey of their principles and history.
Our initial contribution to this field is eSAIS, the first external memory suffix sorting algorithm which uses the induced sorting principle. Its central loop is an elegant reformulation of this principle using an external memory priority queue, and our theoretical analysis shows that eSAIS requires at most Sort(17 n) + Scan(9 n) I/O volume. We then extend eSAIS to also construct the LCP array while suffix sorting, which yields the first implementation of fully external memory suffix and LCP array construction in the literature. Our experiments demonstrate that eSAIS is a factor two faster than DC3, the previously best external memory suffix sorting implementation. After our initial publication of eSAIS, many authors showed interest in the topic and we review their contributions and improvements over eSAIS.
For scaling to even larger inputs, we then consider suffix sorting on a distributed cluster machine. To harness the computational power of a such a system in a convenient dataflow style functional programming paradigm, we propose the new highperformance distributed big data processing framework Thrill. Thrill's central concept is a distributed immutable array (DIA), which is a virtual array of C++ objects distributed onto the cluster. Such arrays can be manipulated using a small set of scalable primitives, such as mapping, reducing, and sorting. These are implemented using pipelined distributed external memory algorithms encapsulated as C++ template classes, which can be efficiently coupled to form large complex applications. Our Thrill prototype is evaluated using five micro benchmarks against the popular frameworks Apache Spark and Flink on up to 16 hosts in the AWS Elastic Compute Cloud. Thrill consistently outperforms the other frameworks in all benchmarks and on all numbers of hosts.
Using Thrill we then implement five suffix sorting algorithms as a case study. Three are based on prefix doubling and two are variants of the lineartime difference cover algorithm DC. The implementation of these complex algorithms demonstrates the expressiveness of the scalable primitives provided by Thrill. They also are the first distributed external memory suffix sorters presented in the literature. We compare them experimentally against two handcoded MPI implementations and the fastest nondistributed sequential suffix sorters. Our results show that algorithms implemented using Thrill are competitive to MPI programs, but scale to larger inputs due to automatic usage of external memory. In the future, these implementations can benefit from improvements of Thrill such as fault tolerance or specialized sorting algorithms.