Massive data sets arise naturally in many domains: geographic information systems, computer graphics, database systems, telecommunication billing systems [20], network analysis [23], and scientific computing [28]. Applications working in those domains have to process terabytes of data. However, the internal memories of computers can keep only a small fraction of these huge data sets. During the processing the applications need to access the external storage (e.g. hard disks). One such access can be about times slower than a main memory access. For any such access to the hard disk, accesses to the next elements in the external memory are much cheaper. In order to amortize the high cost of a random access one can read or write contiguous chunks of size . The I/O becomes the main bottleneck for the applications dealing with the large data sets, therefore one tries to minimize the number of I/O operations performed. In order to increase I/O bandwidth, applications use multiple disks, in parallel. In each I/O step the algorithms try to transfer blocks between the main memory and disks (one block from each disk). This model has been formalized by Vitter and Shriver as Parallel Disk Model (PDM) [38] and is the standard theoretical model for designing and analyzing I/O-efficient algorithms.

Theoretically, I/O-efficient algorithms and data structures have been developed for many problem domains: graph algorithms, string processing, computational geometry, etc. (for a survey see [27]). Some of them have been implemented: sorting, matrix multiplication [36], search trees [8,30,4,1], priority queues [7], text processing [10]. However there is an ever increasing gap between theoretical achievements of external memory algorithms and their practical usage. Several external memory software library projects (LEDA-SM [11] and TPIE [22]) have been started to reduce this gap. They offer frameworks which aim to speed up the process of implementing I/O-efficient algorithms, abstracting away the details of how I/O is performed.

TPIE and LEDA-SM projects are excellent proofs of concept of external memory paradigm, but have some drawbacks which impede their practical usage. This led us to start the development of a performance-oriented library of external memory algorithms and data structures STXXL, which tries to avoid those obstacles. The following are some key features of STXXL:

- Transparent support of parallel disks.
The library provides implementations of basic
*parallel*disk algorithms. STXXL is the only external memory algorithm library supporting parallel disks. Such a feature was announced for TPIE in [35,22]. - The library is able to handle problems of
*very large*size (up to dozens of terabytes). - Improved utilization of computer resources.
STXXL supports explicitly
*overlapping*between I/O and computation. STXXL implementations of external memory algorithms and data structures benefit from overlapping of I/O and computation. - Small constant factors in I/O volume. A unique library
feature
*``pipelining''*can save more than*half*the number of I/Os performed by many algorithms. - Shorter
*development times*due to well known STL-compatible interfaces for external memory algorithms and data structures. STL - Standard Template Library [34] is the library of algorithms and data structures which is a part of the C++ standard. STL algorithms can be directly applied to STXXL containers (code reuse); moreover the I/O complexity of the algorithms remains optimal in most of the cases.

STXXL library is open source and available under the
Boost Software License 1.0 (`http://www.boost.org/LICENSE_1_0.txt`).
The latest version of the
library, a user tutorial and a programmer documentation
can be downloaded at
`http://stxxl.sourceforge.net`. Currently the
size of the library is about 15 000 lines of code.

The remaining part of this paper is organized as follows. Section 2 discusses the design of STXXL. We explain how the generic interfaces of the library support high performance external memory computations using parallel disks, overlapping between I/O and computation, and pipelining. The section provides several examples. In Section 3 we implement a short benchmark that computes a maximal independent set of a graph and use it to study the performance of STXXL. Section 4 gives a short overview of the projects using STXXL. We make some concluding remarks and point out the directions of future work in Section 5.

The shortened version of this paper is also published as [14].