An Algorithm Library for Massive Data Sets

Summer 03

[Chaos] We are working on a a software library STXXL which allows processing of huge data sets that do not fit into RAM memory but only on hard disks. The starting point of this library is the C++ standard template library STL. We want to implement a large part of its functionality to work also for huge inputs.

Prerequisites

A course in algorithms and data structures. A working knowledge of C++. You are encouraged to take the lecture "Algorithm Library Design" at the same time. The Fopra project can be used as the programming project in the lecture.

Further Information

Privatdozent Dr. Peter Sanders, Building 46, R.308, sanders@mpi-sb.mpg.de
Roman Dementiev, Building 46, R.322, email

Possible Projects

The following list gives very preliminary ideas for possible projects. You can work on these projects in groups of one to three people. The implementations can range from prototypical to production quality (well written, documented, tested, and tuned) depending on difficulty and number of people involved. There may also be possibilities to continue projects as master's projects. The order of the list roughly follows our current priorities for completing the library.

YOUR IDEA HERE?
Implement simple STL containers (queues, vectors,...), iterators, and algorithms (find, for_each,...)
How should simple STL components be refined? For example giving hints for the kind of accesses in the future so that the library can do prefetching.
B-Trees: THE external memory search tree data structure. Can be used to implement the STL sorted container classes (multi)map and (multi)set
Priority Queues.
Sorting Strings This is algorithmically interesting because it is not quite obvious how to achieve time O(N + n log n) for sorting n strings of total length N. It is also interesting as an example of an algorithm that works with variable length items. How can our library elegantly support that?
Suffix Array Construction. The key to many string applications: Compression using Burrows-Wheeler transform, full text indexing, various Bioinformatics applications. Can our library support efficient implementation of this nontrivial algorithm using its built in primitives for sorting etc.?
An elegent implementation of pipelining: Often the output of one operation on a sequence of elements (filtering, transforming, sorting,...) is the input for other operations. We can save I/Os by offering a mechanism similar to the unix pipe ("|") operator. How can this be implemented elegantly in an STL-like library?
Implement a class Relation that supports the typical operation of a relational database (select, join, project).
External memory numerics: Sparse matrices, solving partial differential equations,...
External memory minimum spanning trees
External memory graph partitioning for planar graphs
Geometric data structures and algorithms
Frequent Item Set Data Mining

An Algorithm Library for Massive Data Sets Summer 03

Prerequisites

Further Information

Possible Projects

An Algorithm Library for Massive Data Sets

Summer 03