We demonstrate some performance characteristics of STXXL using the external memory maximal independent set (MIS) algorithm from [39] as an example. This algorithm is based on the time-forward processing technique. As the input for the MIS algorithm, we use the random graph computed by the examples in the previous Sections (Listings 4 and 5). Our benchmark also includes the running time of the input generation.
Now we describe the MIS algorithm implementation in Listing 6, which is only nine lines long not including declarations. The algorithm visits the graph nodes scanning lexicographically sorted input edges. When a node is visited, we add it to the maximal independent set if none of its visited neighbours is already in the MIS. The neighbour nodes of the MIS nodes are stored as events in a priority queue. In Lines 6-7, the template metaprogram [12] PRIORITY_QUEUE_GENERATOR computes the type of priority queue that will store events. The metaprogram finds the optimal values for numerous tuning parameters (the number and the maximum arity of external/internal mergers, the size of merge buffers, external memory block size, etc.) under the constraint that the total size of the priority queue internal buffers must be limited by PQ_MEM bytes. The node_greater comparison functor defines the order of nodes of type node_type and minimum value that a node object can have, such that the top() method will return the smallest contained element. The last template parameter tells that the priority queue can not contain more than INPUT_SIZE elements (in 1024 units). Line 8 creates the priority queue depend having prefetch buffer pool of size PQ_PPOOL_MEM bytes and buffered write memory pool of size PQ_WPOOL_MEM bytes. The external vector MIS stores the nodes belonging to the maximal independent set. Ordered input edges come in the form of an STXXL stream called edges. If the current node edges->src is not a neighbour of a MIS node (the comparison with the current event depend.top(), Line 13), then it is included in MIS (if it was not there before, Line 15). All neighbour nodes edges->dst of a node in MIS edges->src are inserted in the event priority queue depend (Line 16). Lines 11-12 remove the events already passed through from the priority queue.
To make a comparison with other external memory libraries, we have implemented the graph generation algorithm using TPIE and LEDA-SM libraries. The MIS algorithm was implemented in LEDA-SM using its array heap data structure as a priority queue. The I/O-efficient implementation of the MIS algorithm was not possible in TPIE, since it does not have an I/O efficient priority queue implementation. For TPIE, we report only the running time of the graph generation. The source code of all our implementations is available under http://i10www.ira.uka.de/dementiev/stxxl/paper/index.shtml.
To make the benchmark closer to real applications,
we have added two 32-bit integer fields in the edge
data structure, which can store some additional information
associated with the edge. The implementations of
priority queue of LEDA-SM always store a pair
key,info
.
The info field takes at least four bytes. Therefore, to make
a fair comparison with STXXL, we have
changed the event data type stored in the priority
queue (Listing 6),
such that it also has a 4-byte dummy info field.
The experiments were run on a 2-processor workstation, having 2 GHz Xeon processors (only one processor was used) and 1 GB of main memory (swapping was switched off). The OS was Debian with Linux kernel 2.4.20. The computer had four 80 GB IDE (IBM/Hitachi 120 GXP series) hard disks formatted with the XFS file system and dedicated solely for the experiments. We used LEDA-SM version 1.3 with LEDA version 4.2.16 and TPIE of January 21, 2005. For compilation of STXXL and TPIE sources, the g++ compiler version 3.3 was used. LEDA-SM and LEDA were compiled with g++ compiler version 2.95, because they could not be compiled by later g++ versions. The compiler optimization level was set to -O3. For sorting we used library sorters that use C++ comparison operators to compare elements. All programs have been tuned to achieve their maximum performance. We have tried all available file access methods and disk block sizes. In order to tune the TPIE benchmark implementation, we followed the performance tuning section of [22]. The input size (the length of the random edge sequence, see Listing 4) for all tests was 2000 MB7. The benchmark programs were limited to use only 512 MB of main memory. The remaining 512 MB are given to operating system kernel, daemons, shared libraries and file system buffer cache, from which TPIE and LEDA-SM might benefit. The STXXL implementations do not use the file system cache.
| LEDA-SM | STXXL-STL | STXXL-Pipel. | TPIE | ||
| Input | Filling |
51/41 |
89/24 | 40/52 | |
| graph | Sorting |
371/23 |
188/45 | 100/20 | 307/28 |
| generation | Dup. removal |
160/26 |
104/40 | 109/39 | |
| MIS computation | 513/6 | 153/21 | 128/26 | -N/A- | |
| Total | 1095/16 | 534/33 | 228/24 | 456*/32* | |
Table 1 compares the MIS benchmark
performance of the LEDA-SM implementation with array heap priority
queue, the STXXL implementation based on the STL-user
level, a pipelined STXXL implementation, and
a TPIE implementation with only input graph generation.
The running times, averaged over three runs, and average I/O bandwidths
are given for each stage of the benchmark.
The running time of the different stages of the pipelined implementation
cannot be measured separately.
However, we show the values of time and I/O counters from the beginning of the
execution till the time when the sorted runs are written to the disk(s)
in the run formation phase of sorting, and from this point to the
end of the MIS computation. The total time numbers show that the pipelined
STXXL implementation is significantly faster than the other implementations.
It is
times faster than the second leading implementation (STXXL-STL).
The win is due to reduced I/O volume: the STXXL-STL implementation
transfers 17 GB, the pipelined implementation
needs only 5.2 GB. However the
fold I/O volume reduction
does not imply equal reduction of the running time because the run formation
fused with filling/generating
phase becomes compute bound.
This is indicated by the almost zero value of the STXXL I/O wait
counter, which measures the time the processing
thread waited for the completion of an I/O. The second reason is that
the fusion of merging, duplicate removal and CPU intensive priority queue
operations in the MIS computation is almost compute bound.
Comparing the running times of the total input graph generation we
conclude that STXXL-STL implementation is about 20 % faster than TPIE
and 53 % faster than LEDA-SM. This could be due to better (explicit)
overlapping between I/O and computation.
Another possible reason could be that TPIE uses a more expensive way
of reporting run-time errors, such as I/O errors8.
The running time of the filling
stage of STXXL-STL implementation is much higher than of TPIE and
LEDA-SM. This is due to the fact that those libraries rely on
operating system cache. The filled blocks do not go immediately
to the disk(s) but remain in the main memory until other data
needs to be cached by the system. The indication of this is
the very high bandwidth of 52 MB/s for TPIE implementation,
which is even higher than the maximum physical disk bandwidth (48 MB/s)
at its outermost zone.
However, the cached blocks need to be flushed in the sorting stage and then
the TPIE implementation pays the remaining due.
The unsatisfactory bandwidth
of 24 MB/s of the STXXL-STL filling phase could be improved by replacing
the call std::generate by the
native stxxl::generate call that efficiently overlaps
I/O and computation.
With a single disk it fills the
vector in 60 seconds with a bandwidth of 33 MB/s.
STXXL STL-user sorter sustains an I/O
bandwidth of about 45 MB/s, which is 95 % of the disk's peak
bandwidth.
The high CPU load in the priority queue
and not very perfect overlapping between I/O and computation
explain the low bandwidth of the MIS computation stage in
all three implementations.
We also run the graph generation test
on 16 GByte inputs. All implementations scale almost linearly with the input
size: the TPIE implementation finishes in 1h 3min, STXXL-STL in
49min, and STXXL-Pipelined in 28min.
The MIS computation of STXXL, which is dominated by PQ operations, is 3.35 times faster than LEDA-SM. The main reason for this big speedup is likely to be the more efficient priority queue algorithm from [31].
| STXXL-STL | STXXL-Pipelined | ||||
| Disks |
2 |
4 |
2 | 4 | |
| Input | Filling |
72/28 |
64/31 |
||
| graph | Sorting |
104/77 |
80/100 |
98/20 | 98/20 |
| generation | Dup. removal |
58/69 |
34/118 |
||
| MIS computation |
127/25 |
114/28 |
112/30 | 110/31 | |
| Total |
360/50 |
291/61 |
210/26 | 208/27 | |
Table 2 shows the parallel disk performance of the STXXL implementations. The STXXL-STL implementation achieves speedup of about 1.5 using two disks and 1.8 using four disks. The reason for this low speedup is that many parts of the code become compute bound: priority queue operations in the MIS computation stage, run formation in the sorting stage, and generating random edges in the filling stage. The STXXL-Pipelined implementation was almost compute bound in the single disk case, and as expected, with two disks the first phase shows no speedup. However the second phase has a small improvement in speed due to faster I/O. Close to zero I/O wait time indicates that the STXXL-Pipelined implementation is fully compute bound when running with two or four disks. We had run the STXXL-Pipelined implementation on very large graphs that require the entire space of four hard disks (360 GBytes). The results of this experiment, using a faster Opteron system, are shown in Table 3.
| Input volume | #nodes | #edges | #edges/#nodes | Running time | ||
| 100 GB | 200 |
|
4 | 2h 34min | ||
| 100 GB | 200 |
|
4 | 2h 44min |