Bloom Filters for GroupBy ReduceBy and Join in Thrill
Thrill is a prototype of a high-performance general purpose big data processing framework. In the Reduce operation, which is similar to Reduce in MapReduce, Thrill performs an all-to-all hash based data shuffle of large amounts of data. Elements only occuring on a single worker could however be reduced locally without shuffling them. To find these elements, we propose a detection algorithm based on distributed single-shot Bloom filters to efficiently detect a large portion of these elements.
Additionally, we implemented an SQL-style InnerJoin operator for Thrill. For the InnerJoin operator it is not possible to perform local reduction before the hash based data shuffle. Therefore we implemented an augmented version of our detection algorithm, which detects the worker with the highest number of total occurences for each key. In order to reduce the amount of total network traffic, this worker is determined as the shuffle target for that key.
We use multiple algorithmic micro-benchmarks on the AWS EC2 computing cluster to benchmark the performance of our implementations in the Thrill framework.
In communication-heavy benchmarks, such as median computation and the TPC-H4 database join query, we can improve the running time by a factor of 1.5 up to 10 by using duplicate
detection. In the WordCount and PageRank algorithms performed on real world data we can lower the amount of network traffic while keeping a comparable running time.