Performance

We demonstrate some performance characteristics of STXXL using the external memory maximal independent set (MIS) algorithm from [39] as an example. This algorithm is based on the time-forward processing technique. As the input for the MIS algorithm, we use the random graph computed by the examples in the previous Sections (Listings 4 and 5). Our benchmark also includes the running time of the input generation.

Now we describe the MIS algorithm implementation in
Listing 6, which is only nine lines long
not including declarations.
The algorithm visits the graph nodes scanning
lexicographically sorted input edges. When a
node is visited, we add it to the maximal independent set
if none of its visited neighbours is already in
the MIS. The neighbour nodes of the MIS nodes are stored
as events in a priority queue.
In Lines
6-7,
the template metaprogram [12]
`PRIORITY_QUEUE_GENERATOR`
computes the type of priority queue that will store events.
The metaprogram finds the optimal values for numerous tuning
parameters (the number and the maximum arity of
external/internal mergers, the size of merge buffers,
external memory block size, etc.) under the constraint that
the total size of the priority queue internal buffers must
be limited by `PQ_MEM` bytes.
The `node_greater` comparison functor defines
the order of nodes of type `node_type` and minimum
value that a node object can have, such that the `top()` method
will return the smallest contained element.
The last template parameter tells that the priority
queue can not contain more than `INPUT_SIZE`
elements (in 1024 units). Line 8 creates
the priority queue `depend` having prefetch buffer pool
of size `PQ_PPOOL_MEM` bytes and buffered write
memory pool of size `PQ_WPOOL_MEM` bytes. The external
vector `MIS` stores the nodes belonging to the
maximal independent set. Ordered input edges come in the form
of an STXXL stream called `edges`. If the current node
`edges->src` is not a neighbour of a MIS node
(the comparison with the current event `depend.top()`,
Line 13), then it is included in
`MIS` (if it was not there before, Line 15).
All neighbour nodes `edges->dst` of a node in MIS
`edges->src` are inserted in the event priority
queue `depend` (Line 16).
Lines 11-12
remove the events already passed through from the priority queue.

To make a comparison with other external memory
libraries, we have implemented the graph generation algorithm using
TPIE and LEDA-SM libraries. The MIS algorithm was implemented in
LEDA-SM using its array heap data structure as a priority queue.
The I/O-efficient implementation of the MIS algorithm was not possible
in TPIE, since it does not have an I/O efficient priority queue
implementation. For TPIE, we report only the running time of the
graph generation. The source code of all our implementations
is available under `http://i10www.ira.uka.de/dementiev/stxxl/paper/index.shtml`.

To make the benchmark closer to real applications,
we have added two 32-bit integer fields in the `edge`
data structure, which can store some additional information
associated with the edge. The implementations of
priority queue of LEDA-SM always store a pair key,info.
The info field takes at least four bytes. Therefore, to make
a fair comparison with STXXL, we have
changed the event data type stored in the priority
queue (Listing 6),
such that it also has a 4-byte dummy info field.

The experiments were run on a 2-processor workstation,
having 2 GHz Xeon processors (only one processor was used) and 1 GB of main
memory (swapping was switched off). The OS was Debian with Linux
kernel 2.4.20. The computer had four 80 GB IDE (IBM/Hitachi 120 GXP
series) hard disks formatted with the XFS file system and dedicated
solely for the experiments. We used LEDA-SM version 1.3 with
LEDA version 4.2.1^{6} and TPIE
of January 21, 2005. For compilation
of STXXL and TPIE sources, the `g++`
compiler version 3.3 was used.
LEDA-SM and LEDA were compiled with
`g++` compiler version 2.95,
because they could not be compiled by later
`g++` versions. The compiler
optimization level
was set to `-O3`.
For sorting we used library sorters that
use C++ comparison operators to compare elements.
All programs have been tuned to achieve
their maximum performance. We have tried
all available file access methods and disk
block sizes. In order to tune the TPIE benchmark
implementation, we followed the performance
tuning section of [22].
The input size (the length of the random edge sequence,
see Listing 4)
for all tests was 2000 MB^{7}.
The benchmark programs were limited to use only
512 MB of main memory. The remaining 512 MB
are given to operating system kernel, daemons,
shared libraries and file system buffer cache,
from which TPIE and LEDA-SM might benefit.
The STXXL implementations do not use the file system cache.

LEDA-SM | STXXL-STL | STXXL-Pipel. | TPIE | ||

Input | Filling |
51/41 |
89/24 | 40/52 | |

graph | Sorting |
371/23 |
188/45 | 100/20 | 307/28 |

generation | Dup. removal |
160/26 |
104/40 | 109/39 | |

MIS computation | 513/6 | 153/21 | 128/26 | -N/A- | |

Total | 1095/16 | 534/33 | 228/24 | 456*/32* |

Table 1 compares the MIS benchmark
performance of the LEDA-SM implementation with array heap priority
queue, the STXXL implementation based on the STL-user
level, a pipelined STXXL implementation, and
a TPIE implementation with only input graph generation.
The running times, averaged over three runs, and average I/O bandwidths
are given for each stage of the benchmark.
The running time of the different stages of the pipelined implementation
cannot be measured separately.
However, we show the values of time and I/O counters from the beginning of the
execution till the time when the sorted runs are written to the disk(s)
in the run formation phase of sorting, and from this point to the
end of the MIS computation. The total time numbers show that the pipelined
STXXL implementation is significantly faster than the other implementations.
It is times faster than the second leading implementation (STXXL-STL).
The win is due to reduced I/O volume: the STXXL-STL implementation
transfers 17 GB, the pipelined implementation
needs only 5.2 GB. However the fold I/O volume reduction
does not imply equal reduction of the running time because the run formation
fused with filling/generating
phase becomes compute bound.
This is indicated by the almost zero value of the STXXL I/O wait
counter, which measures the time the processing
thread waited for the completion of an I/O. The second reason is that
the fusion of merging, duplicate removal and CPU intensive priority queue
operations in the MIS computation is almost compute bound.
Comparing the running times of the total input graph generation we
conclude that STXXL-STL implementation is about 20 % faster than TPIE
and 53 % faster than LEDA-SM. This could be due to better (explicit)
overlapping between I/O and computation.
Another possible reason could be that TPIE uses a more expensive way
of reporting run-time errors, such as I/O errors^{8}.
The running time of the filling
stage of STXXL-STL implementation is much higher than of TPIE and
LEDA-SM. This is due to the fact that those libraries rely on
operating system cache. The filled blocks do not go immediately
to the disk(s) but remain in the main memory until other data
needs to be cached by the system. The indication of this is
the very high bandwidth of 52 MB/s for TPIE implementation,
which is even higher than the maximum physical disk bandwidth (48 MB/s)
at its outermost zone.
However, the cached blocks need to be flushed in the sorting stage and then
the TPIE implementation pays the remaining due.
The unsatisfactory bandwidth
of 24 MB/s of the STXXL-STL filling phase could be improved by replacing
the call `std::generate` by the
native `stxxl::generate` call that efficiently overlaps
I/O and computation.
With a single disk it fills the
vector in 60 seconds with a bandwidth of 33 MB/s.
STXXL STL-user sorter sustains an I/O
bandwidth of about 45 MB/s, which is 95 % of the disk's peak
bandwidth.
The high CPU load in the priority queue
and not very perfect overlapping between I/O and computation
explain the low bandwidth of the MIS computation stage in
all three implementations.
We also run the graph generation test
on 16 GByte inputs. All implementations scale almost linearly with the input
size: the TPIE implementation finishes in 1h 3min, STXXL-STL in
49min, and STXXL-Pipelined in 28min.

The MIS computation of STXXL, which is dominated by PQ operations, is 3.35 times faster than LEDA-SM. The main reason for this big speedup is likely to be the more efficient priority queue algorithm from [31].

STXXL-STL | STXXL-Pipelined | ||||

Disks |
2 |
4 |
2 | 4 | |

Input | Filling |
72/28 |
64/31 |
||

graph | Sorting |
104/77 |
80/100 |
98/20 | 98/20 |

generation | Dup. removal |
58/69 |
34/118 |
||

MIS computation |
127/25 |
114/28 |
112/30 | 110/31 | |

Total |
360/50 |
291/61 |
210/26 | 208/27 |

Table 2 shows the parallel disk performance of the STXXL implementations. The STXXL-STL implementation achieves speedup of about 1.5 using two disks and 1.8 using four disks. The reason for this low speedup is that many parts of the code become compute bound: priority queue operations in the MIS computation stage, run formation in the sorting stage, and generating random edges in the filling stage. The STXXL-Pipelined implementation was almost compute bound in the single disk case, and as expected, with two disks the first phase shows no speedup. However the second phase has a small improvement in speed due to faster I/O. Close to zero I/O wait time indicates that the STXXL-Pipelined implementation is fully compute bound when running with two or four disks. We had run the STXXL-Pipelined implementation on very large graphs that require the entire space of four hard disks (360 GBytes). The results of this experiment, using a faster Opteron system, are shown in Table 3.

Input volume | #nodes | #edges | #edges/#nodes | Running time | ||

100 GB | 200 | 4 | 2h 34min | |||

100 GB | 200 | 4 | 2h 44min |