Documentation
Doxygen-generated documentation
The library is under continuous development. Therefore, please report all bugs and performance insufficiencies immediately. Sorting currently works best, as does embarrassing parallelism like for_each. An algorithm not available in the (original) STL is multiway merging. However, it is very important for both parallel and/or external sorting. Hence, it is included in the MCSTL.
Platform Requirements
MCSTL should successfully compile with all compilers that support OpenMP version 2.5. The GCC implementation of the C++ standard library (libstdc++) must be used, however. We have tested up to GCC 4.2.2 and the Intel C++ Compiler 10.1.011 (Intel compiler particularities). MCSTL has only been tested on Linux and Solaris, so far. Tests with the Microsoft compiler are not yet successful.
Changes to Your Application
You do not have to change any source code in you application. However, to benefit from the speedup, you should use STL algorithms like
sort, random_shuffle, partition, merge, find, nth_element, partial_sum, and for_each
and friends
as often as possible.
The only thing you have to do to get parallelism incorporated, is to include two additional include directories, e. g. by prepending
-I$(MCSTL_ROOT)/c++
to the command line. The alternative headers will be superimposed transparently, so no changes are made to the original headers. OpenMP must be supported and switched on, e. g. by adding
-fopenmp
to the command line for GCC,
-openmp
for the Intel compiler.
Tuning and Customization
For using any of the tuning options, you must
#include <mcstl.h>
first. To specify the number of threads to be used, set the
mcstl::SETTINGS::num_threads
variable accordingly. To force a function to execute sequentially, addmcstl::sequential_tag()
to the end of the argument list. On the other hand side, you can force some functions to be executed in parallel, and also, whether they will be any load-balancing applied (recommended for dynamic workload).
Parallelism always incurs some overhead. Thus, it is not helpful to parallelize operations on very small sets of data. MCSTL provides measures to avoid parallelizing stuff that is not worth it. For each algorithm, a minimum problem size can be stated, usually using the variable
mcstl::SETTINGS::[algorithm]_minimal_n.
Please see mcstl.h for details.
Getting Started Example
Compile the file parsort.cpp using:g++-4.2.1 -O3 -fopenmp -I$MCSTL_ROOT/c++ -o parsort parsort.cpp
This assumes that your GCC 4.2.1 is installed in
/usr/local/bin
.The MCSTL is expected in
$MCSTL_ROOT.
For reliable performance measurement, it is mandatory to switch on optimization, of course. Since atomic operations are used, that would not be supported by older processors, you might have to specify
-march=i686
on 32-bit Intel architectures, to get the code compiled. This, of course, means that the program will only run on Pentium-II-class or better computers, which should not be a severe limitation, though.
Using MCSTL and STXXL in Combination
MCSTL performs particularly well in combination with the STXXL. Just compile the STXXL with the additional make targets of the latest STXXL release, and compile your program with the options described above. Below, experimental results for sorting are shown. A computer with 4 processors and 8 (parallel) disks was used. The system achieves a peak I/O bandwidth of 600 MB/s. Each record contained 8 bytes, namely 4 bytes key and 4 bytes load. As one can see clearly, the compute bound can be raised by dividing the load onto many processors using MCSTL. (The results where measured with MCSTL version 0.4.0-alpha.)
Still, there is more potential when one explicitly parallelizes the used algorithm. This has also been done already, the multiway merging step is parallelized. However, some future tuning is necessary to achieve speedup under all circumstances. Therefore, the corresponding code has been disabled. To enabled it, define the symbol STXXL_PARALLEL_MULTIWAY_MERGE.