next up previous
Next: Applications Up: STXXL : Standard Template Previous: Streaming Layer


Performance

We demonstrate some performance characteristics of STXXL using the external memory maximal independent set (MIS) algorithm from [39] as an example. This algorithm is based on the time-forward processing technique. As the input for the MIS algorithm, we use the random graph computed by the examples in the previous Sections (Listings 4 and 5). Our benchmark also includes the running time of the input generation.

Now we describe the MIS algorithm implementation in Listing 6, which is only nine lines long not including declarations. The algorithm visits the graph nodes scanning lexicographically sorted input edges. When a node is visited, we add it to the maximal independent set if none of its visited neighbours is already in the MIS. The neighbour nodes of the MIS nodes are stored as events in a priority queue. In Lines 6-7, the template metaprogram [12] PRIORITY_QUEUE_GENERATOR computes the type of priority queue that will store events. The metaprogram finds the optimal values for numerous tuning parameters (the number and the maximum arity of external/internal mergers, the size of merge buffers, external memory block size, etc.) under the constraint that the total size of the priority queue internal buffers must be limited by PQ_MEM bytes. The node_greater comparison functor defines the order of nodes of type node_type and minimum value that a node object can have, such that the top() method will return the smallest contained element. The last template parameter tells that the priority queue can not contain more than INPUT_SIZE elements (in 1024 units). Line 8 creates the priority queue depend having prefetch buffer pool of size PQ_PPOOL_MEM bytes and buffered write memory pool of size PQ_WPOOL_MEM bytes. The external vector MIS stores the nodes belonging to the maximal independent set. Ordered input edges come in the form of an STXXL stream called edges. If the current node edges->src is not a neighbour of a MIS node (the comparison with the current event depend.top(), Line 13), then it is included in MIS (if it was not there before, Line 15). All neighbour nodes edges->dst of a node in MIS edges->src are inserted in the event priority queue depend (Line 16). Lines 11-12 remove the events already passed through from the priority queue.


\begin{lstlisting}[float,
caption={Computing a Maximal Independent Set using {\...
...ack(edges->src); (*@
@*)
depend.push(edges->dst); (*@
@*)
}
}
\end{lstlisting}

To make a comparison with other external memory libraries, we have implemented the graph generation algorithm using TPIE and LEDA-SM libraries. The MIS algorithm was implemented in LEDA-SM using its array heap data structure as a priority queue. The I/O-efficient implementation of the MIS algorithm was not possible in TPIE, since it does not have an I/O efficient priority queue implementation. For TPIE, we report only the running time of the graph generation. The source code of all our implementations is available under http://i10www.ira.uka.de/dementiev/stxxl/paper/index.shtml.

To make the benchmark closer to real applications, we have added two 32-bit integer fields in the edge data structure, which can store some additional information associated with the edge. The implementations of priority queue of LEDA-SM always store a pair $<$key,info$>$. The info field takes at least four bytes. Therefore, to make a fair comparison with STXXL, we have changed the event data type stored in the priority queue (Listing 6), such that it also has a 4-byte dummy info field.

The experiments were run on a 2-processor workstation, having 2 GHz Xeon processors (only one processor was used) and 1 GB of main memory (swapping was switched off). The OS was Debian with Linux kernel 2.4.20. The computer had four 80 GB IDE (IBM/Hitachi 120 GXP series) hard disks formatted with the XFS file system and dedicated solely for the experiments. We used LEDA-SM version 1.3 with LEDA version 4.2.16 and TPIE of January 21, 2005. For compilation of STXXL and TPIE sources, the g++ compiler version 3.3 was used. LEDA-SM and LEDA were compiled with g++ compiler version 2.95, because they could not be compiled by later g++ versions. The compiler optimization level was set to -O3. For sorting we used library sorters that use C++ comparison operators to compare elements. All programs have been tuned to achieve their maximum performance. We have tried all available file access methods and disk block sizes. In order to tune the TPIE benchmark implementation, we followed the performance tuning section of [22]. The input size (the length of the random edge sequence, see Listing 4) for all tests was 2000 MB7. The benchmark programs were limited to use only 512 MB of main memory. The remaining 512 MB are given to operating system kernel, daemons, shared libraries and file system buffer cache, from which TPIE and LEDA-SM might benefit. The STXXL implementations do not use the file system cache.


Table 1: Running time (in seconds)/I/O bandwidth (in MB/s) of the MIS benchmark running on single disk. For TPIE only graph generation is shown (marked with *).
  LEDA-SM STXXL-STL STXXL-Pipel. TPIE
Input Filling
51/41
89/24   40/52
graph Sorting
371/23
188/45 100/20 307/28
generation Dup. removal
160/26
104/40   109/39
MIS computation 513/6 153/21 128/26 -N/A-
Total 1095/16 534/33 228/24 456*/32*

Table 1 compares the MIS benchmark performance of the LEDA-SM implementation with array heap priority queue, the STXXL implementation based on the STL-user level, a pipelined STXXL implementation, and a TPIE implementation with only input graph generation. The running times, averaged over three runs, and average I/O bandwidths are given for each stage of the benchmark. The running time of the different stages of the pipelined implementation cannot be measured separately. However, we show the values of time and I/O counters from the beginning of the execution till the time when the sorted runs are written to the disk(s) in the run formation phase of sorting, and from this point to the end of the MIS computation. The total time numbers show that the pipelined STXXL implementation is significantly faster than the other implementations. It is $2.4$ times faster than the second leading implementation (STXXL-STL). The win is due to reduced I/O volume: the STXXL-STL implementation transfers 17 GB, the pipelined implementation needs only 5.2 GB. However the $3.25$ fold I/O volume reduction does not imply equal reduction of the running time because the run formation fused with filling/generating phase becomes compute bound. This is indicated by the almost zero value of the STXXL I/O wait counter, which measures the time the processing thread waited for the completion of an I/O. The second reason is that the fusion of merging, duplicate removal and CPU intensive priority queue operations in the MIS computation is almost compute bound. Comparing the running times of the total input graph generation we conclude that STXXL-STL implementation is about 20 % faster than TPIE and 53 % faster than LEDA-SM. This could be due to better (explicit) overlapping between I/O and computation. Another possible reason could be that TPIE uses a more expensive way of reporting run-time errors, such as I/O errors8. The running time of the filling stage of STXXL-STL implementation is much higher than of TPIE and LEDA-SM. This is due to the fact that those libraries rely on operating system cache. The filled blocks do not go immediately to the disk(s) but remain in the main memory until other data needs to be cached by the system. The indication of this is the very high bandwidth of 52 MB/s for TPIE implementation, which is even higher than the maximum physical disk bandwidth (48 MB/s) at its outermost zone. However, the cached blocks need to be flushed in the sorting stage and then the TPIE implementation pays the remaining due. The unsatisfactory bandwidth of 24 MB/s of the STXXL-STL filling phase could be improved by replacing the call std::generate by the native stxxl::generate call that efficiently overlaps I/O and computation. With a single disk it fills the vector in 60 seconds with a bandwidth of 33 MB/s. STXXL STL-user sorter sustains an I/O bandwidth of about 45 MB/s, which is 95 % of the disk's peak bandwidth. The high CPU load in the priority queue and not very perfect overlapping between I/O and computation explain the low bandwidth of the MIS computation stage in all three implementations. We also run the graph generation test on 16 GByte inputs. All implementations scale almost linearly with the input size: the TPIE implementation finishes in 1h 3min, STXXL-STL in 49min, and STXXL-Pipelined in 28min.

The MIS computation of STXXL, which is dominated by PQ operations, is 3.35 times faster than LEDA-SM. The main reason for this big speedup is likely to be the more efficient priority queue algorithm from [31].


Table 2: Running time (in seconds)/I/O bandwidth (in MB/s) of the MIS benchmark running on multiple disk.
  STXXL-STL STXXL-Pipelined
Disks
2
4
2 4
Input Filling
72/28
64/31
   
graph Sorting
104/77
80/100
98/20 98/20
generation Dup. removal
58/69
34/118
   
MIS computation
127/25
114/28
112/30 110/31
Total
360/50
291/61
210/26 208/27

Table 2 shows the parallel disk performance of the STXXL implementations. The STXXL-STL implementation achieves speedup of about 1.5 using two disks and 1.8 using four disks. The reason for this low speedup is that many parts of the code become compute bound: priority queue operations in the MIS computation stage, run formation in the sorting stage, and generating random edges in the filling stage. The STXXL-Pipelined implementation was almost compute bound in the single disk case, and as expected, with two disks the first phase shows no speedup. However the second phase has a small improvement in speed due to faster I/O. Close to zero I/O wait time indicates that the STXXL-Pipelined implementation is fully compute bound when running with two or four disks. We had run the STXXL-Pipelined implementation on very large graphs that require the entire space of four hard disks (360 GBytes). The results of this experiment, using a faster Opteron system, are shown in Table 3.


Table 3: Running time of the STXXL-Pipelined implementation running on very large random graphs (Opteron system).
Input volume $N/M$ #nodes #edges #edges/#nodes $D$ Running time
100 GB 200 $2.1\cdot 10^9$ $13.4\cdot 10^9$ $6.25$ 4 2h 34min
100 GB 200 $4.3\cdot 10^9$ $13.4\cdot 10^9$ $3.13$ 4 2h 44min


next up previous
Next: Applications Up: STXXL : Standard Template Previous: Streaming Layer
Roman Dementiev 2005-08-09