Home | english  | Impressum | Datenschutz | Sitemap | KIT

Throughput Optimization in Distributed Databases via Hypergraph Partitioning

Throughput Optimization in Distributed Databases via Hypergraph Partitioning
Forschungsthema:Hypergraph Partitioning

Sebastian Schlag, Tobias Heuer (SAP)

In the age of Big Data and Cloud Computing databases are implemented in a distributed fashion. In such a distributed database system tables are partitioned by horizontal fragmentation and then assigned to a physical machines of a cluster. Incoming user queries are translated into execution plans and scheduled across the corresponding nodes that store the relevant query data. Executing a query on different nodes often results in nearly linear speedup, e.g. if data partitions can be processed independently. However, when executing distributed algorithms or processing queries in parallel, the data distribution can have a significant impact on the query execution time and thus the throughput of the system.

This problem is known as the allocation problem: Given a set of data items (e.g. records or horizontal partitions of database tables) and an expected query workload, the goal is to assign the data items to physical machines of a cluster such that the total execution time of the given workload is minimized [3], while satisfying some balance constraint (e.g. computational load balance or even distribution of data items).

Recent approaches to solve the problem are based on a reduction to hypergraph partitioning [2]. A hypergraph is a generalization of a graph where an edge can connect more than two nodes. The hypergraph partitioning problem is to find a partition of the nodes into k non-empty disjoint blocks, such that the blocks are roughly equal-sized. Additionally, we want to minimize some objective function defined on the edges. A high quality partition of the hypergraph can be interpreted as an assignment of data items to physical nodes which minimizes communication overhead and optimizes the node utilization. The impact of the partition in practice heavily depends on a meaningful choice of the node and edge weights.

In this master thesis we want to investigate the impact of such an approach in a real world database system. During your thesis you will be part of the SAP Big Data Vora Team and implement your work directly into an existing distributed database system. Furthermore, we will cooperate with Institute of Theoretical Informatics, KIT and employ the state-of-the-art hypergraph partitioner KaHyPar [1]. The high level idea is to transform queries enriched with some statistical information about the execution into a hypergraph model and then periodically use it to find a repartitioning of the data items via hypergraph partitioning.

[1] Yaroslav Akhremtsev, Tobias Heuer, Peter Sanders, and Sebastian Schlag. Engineering a direct k-way hypergraph
partitioning algorithm. In 19th Workshop on Algorithm Engineering and Experiments, (ALENEX 2017), 2017.
[2] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a workload-driven approach to database
replication and partitioning. Proceedings of the VLDB Endowment, 3(1-2):48–57, 2010.
[3] K Ashwin Kumar, Abdul Quamar, Amol Deshpande, and Samir Khuller. Sword: workload-aware data placement
and replica selection for cloud data management systems. The VLDB Journal, 23(6):845–870, 2014.