MCSTL: The Multi-Core Standard Template Library


Doxygen-generated documentation

The library is under continuous development. Therefore, please report all bugs and performance insufficiencies immediately. Sorting currently works best, as does embarrassing parallelism like for_each. An algorithm not available in the (original) STL is multiway merging. However, it is very important for both parallel and/or external sorting. Hence, it is included in the MCSTL.

Platform Requirements

MCSTL should successfully compile with all compilers that support OpenMP version 2.5. The GCC implementation of the C++ standard library (libstdc++) must be used, however. We have tested up to GCC 4.2.2 and the Intel C++ Compiler 10.1.011 (Intel compiler particularities). MCSTL has only been tested on Linux and Solaris, so far. Tests with the Microsoft compiler are not yet successful.

Changes to Your Application

You do not have to change any source code in you application. However, to benefit from the speedup, you should use STL algorithms like

sort, random_shuffle, partition, merge, find, nth_element, partial_sum, and for_each

and friends as often as possible.
The only thing you have to do to get parallelism incorporated, is to include two additional include directories, e. g. by prepending


to the command line. The alternative headers will be superimposed transparently, so no changes are made to the original headers. OpenMP must be supported and switched on, e. g. by adding


to the command line for GCC,


for the Intel compiler.

Tuning and Customization

For using any of the tuning options, you must

#include <mcstl.h>

first. To specify the number of threads to be used, set the


variable accordingly. To force a function to execute sequentially, add


to the end of the argument list. On the other hand side, you can force some functions to be executed in parallel, and also, whether they will be any load-balancing applied (recommended for dynamic workload).

Parallelism always incurs some overhead. Thus, it is not helpful to parallelize operations on very small sets of data. MCSTL provides measures to avoid parallelizing stuff that is not worth it. For each algorithm, a minimum problem size can be stated, usually using the variable


Please see mcstl.h for details.

Getting Started Example

Compile the file parsort.cpp using:

g++-4.2.1 -O3 -fopenmp -I$MCSTL_ROOT/c++ -o parsort parsort.cpp

This assumes that your GCC 4.2.1 is installed in



The MCSTL is expected in


For reliable performance measurement, it is mandatory to switch on optimization, of course. Since atomic operations are used, that would not be supported by older processors, you might have to specify


on 32-bit Intel architectures, to get the code compiled. This, of course, means that the program will only run on Pentium-II-class or better computers, which should not be a severe limitation, though.

Using MCSTL and STXXL in Combination

MCSTL performs particularly well in combination with the STXXL. Just compile the STXXL with the additional make targets of the latest STXXL release, and compile your program with the options described above. Below, experimental results for sorting are shown. A computer with 4 processors and 8 (parallel) disks was used. The system achieves a peak I/O bandwidth of 600 MB/s. Each record contained 8 bytes, namely 4 bytes key and 4 bytes load. As one can see clearly, the compute bound can be raised by dividing the load onto many processors using MCSTL. (The results where measured with MCSTL version 0.4.0-alpha.)

Still, there is more potential when one explicitly parallelizes the used algorithm. This has also been done already, the multiway merging step is parallelized. However, some future tuning is necessary to achieve speedup under all circumstances. Therefore, the corresponding code has been disabled. To enabled it, define the symbol STXXL_PARALLEL_MULTIWAY_MERGE.

MCSTL STXXL sort benchmark

The Multi-Core Standard Template Library