An Algorithm Library for Massive Data Sets
Summer 03
We are working on a a software library STXXL which allows processing
of huge data sets that do not fit into RAM memory but only on
hard disks. The starting point of this library is the C++ standard
template library STL. We want to implement a large part of its functionality
to work also for huge inputs.
Prerequisites
A course in algorithms and data structures.
A working knowledge of C++. You are encouraged to take
the lecture "Algorithm Library Design" at the same time.
The Fopra project can be used as the programming project in the lecture.
Further Information
- Privatdozent Dr. Peter Sanders, Building 46, R.308,
sanders@mpi-sb.mpg.de
Roman Dementiev, Building 46, R.322,
email
Possible Projects
The following list gives very preliminary ideas for possible projects.
You can work on these projects in groups of one to three people.
The implementations can range from prototypical to production quality
(well written, documented, tested, and tuned)
depending on difficulty and number of people involved.
There may also be possibilities to continue projects as master's projects.
The order of the list roughly follows our current priorities for
completing the library.
- YOUR IDEA HERE?
- Implement simple STL containers (queues, vectors,...), iterators, and
algorithms (find, for_each,...)
- How should simple STL components be refined? For example giving
hints for the kind of accesses in the future so that the library can do prefetching.
- B-Trees: THE external memory search tree data structure.
Can be used to implement the STL sorted container classes (multi)map and (multi)set
- Priority Queues.
- Sorting Strings This
is algorithmically interesting because it is not quite obvious how
to achieve time O(N + n log n) for sorting n strings of total length N.
It is also interesting as an example of an algorithm that works with variable length
items. How can our library elegantly support that?
- Suffix Array Construction.
The key to many string applications: Compression using Burrows-Wheeler transform,
full text indexing, various Bioinformatics applications. Can our library
support efficient implementation of this nontrivial algorithm using
its built in primitives for sorting etc.?
- An elegent implementation of pipelining: Often the output of one
operation on a sequence of elements (filtering, transforming, sorting,...)
is the input for other operations. We can save I/Os by offering a mechanism
similar to the unix pipe ("|") operator. How can this be implemented elegantly
in an STL-like library?
- Implement a class Relation that supports the typical operation of
a relational database (select, join, project).
- External memory numerics: Sparse matrices, solving partial differential equations,...
- External memory minimum spanning trees
- External memory graph partitioning for planar graphs
- Geometric data structures and algorithms
- Frequent Item Set Data Mining