The streaming layer provides a framework for pipelined processing of large sequences. Many external memory algorithms implemented with the STXXL streaming layer save a factor at least two in I/Os. The pipelined processing technique is well known in the database world . To the best of our knowledge we are the first who apply this method systematically in the domain of external memory algorithms. We introduce it in the context of an external memory software library.
Usually the interface of an external memory algorithm assumes that it reads the input from external memory container(s) and writes output in external memory container(s). The idea of pipelining is to equip the external memory algorithms with a new interface that allows them to feed the output as a data stream directly to the algorithm that consumes the output, rather than writing it to external memory. Logically, the input of an external memory algorithm does not have to reside in external memory, it could be rather a data stream produced by another external memory algorithm.
Many external memory algorithms can be viewed as a data flow through a directed acyclic graph . The file nodes represent physical data sources and data sinks, which are stored on disks (e.g. in the external memory containers of STL-user layer). A file node outputs or/and reads one stream of elements. The streaming nodes read zero, one or several streams and output zero, one or several new streams. Streaming nodes are equivalent to scan operations in non-pipelined external memory algorithms. The difference is that non-pipelined conventional scanning needs a linear number of I/Os, whereas streaming nodes usually do not perform any I/O, unless a node needs to access external memory data structures (stacks, priority queues, etc.). The sorting nodes read a stream and output it in a sorted order. Edges in the graph denote the directions of data flow between nodes. The question ``When a pipelined execution of the computations in a data flow graph is possible in an I/O-efficient way?'' is analyzed in .
In STXXL, all data flow node implementations have an STXXL stream interface, which is similar to STL Input iterators5. As an input iterator, an STXXL stream object may be dereferenced to refer to some object and may be incremented to proceed to the next object in the stream. The reference obtained by dereferencing is read-only and must be convertible to the value_type of the STXXL stream. The concept of STXXL stream also defines a boolean member function empty() which returns true iff the end of the stream is reached.
Now we tabulate the valid expressions and the expression semantics of STXXL stream concept in the style of STL documentation.
|X, X1, , Xn||A type that is a model of STXXL stream|
|T||The value type of X|
|s, s1, , sn||Object of type X, X1, , Xn|
|t||Object of type T|
|Name||Expression||Type requirements||Return type|
|Constructor||X s(s1,...,sn)||s1, , sn are convertible to X1&, , Xn&|
|Dereference||*s||Convertible to T|
|Member access||s->m||T is a type for which t.m is defined|
|End of stream check||(*s).empty()||bool|
|Constructor||X s(s1,...,sn)||s1, , sn are the input streams of s|
|Dereference||*s||s is incrementable|
|Member access||s->m||s is incrementable||Equivalent to (*s).m|
|Preincrement||++s||s is incrementable||s is incrementable or past-the-end|
The binding of a STXXL stream object to its input streams (incoming edges in a data flow graph ) happens at compile time, i.e. statically. The other approach would be to allow binding at running time using the C++ virtual function mechanism. However this would result in a severe performance penalty because most C++ compilers are not able to inline virtual functions. To avoid this disadvantage, we follow the static binding approach using C++ templates. For example, assuming that streams s1, , sn are already constructed, construction of stream s with constructor X::X(X1& s1,..., Xn& sn) will bind s to its inputs s1, , sn.
After creating all node objects, the computation starts in a ``lazy'' fashion, first trying to evaluate the result of the topologically latest node. The node reads its intermediate input nodes, element by element, using dereference and increment operator of the STXXL stream interface. The input nodes procede in the same way, invoking the inputs needed to produce an output element. This process terminates when the result of the topologically latest node is computed. This style of pipelined execution scheduling is I/O-efficient, it allows to keep the intermediate results in-memory without needing to store them in external memory.
Streaming layer of STXXL library offers generic classes which implement the functionality of sorting, file, and streaming nodes:
As mentioned above, STXXL allows streaming nodes to have more than one output. In this case only one output of a streaming node can have the STXXL stream interface. The other outputs must then be passed to file nodes (e.g. via calling the method push_back of stxxl::vector) or sorting nodes (they have a push_back interface too).
Now we ``pipeline'' the random graph generation example shown in the previous chapter. The data flow graph of the algorithm is presented in Figure 2 in the appendix. Listing 5 shows the pipelined code of the algorithm, the definitions of edge, random_edge, and edge_cmp are in Listing 3. Since the sorter of the streaming layer accepts an STXXL stream input, we do not need to output the random edges. Rather, we generate them on the fly. The random_edge_stream object (model of STXXL stream) constructed in Line 19 supplies the sorter with a stream of random edges. In Line 20, we define the type of the sorter node; it is parameterized by the type of the input stream and the type of the comparison function object. Line 21 creates a SortedStream object attaching its input to the RandomStream. The internal memory consumption of the sorter stream object is bounded to 512 MB. The UniqueStream object filters the duplicates in its input edge stream (Line 23). The generic stream::unique stream class stems from the STXXL library. Line 26 records the content of the UniqueStream into the external memory vector. As in the Listing 4 (Line 27), we cut the vector at the NewEnd boundary. Let us count the number of I/Os the program performs: random edge generation by RandomStream costs no I/O; sorting in SortedStream needs to store the sorted runs and read them again to merge -- I/Os; UniqueStream deletes duplicates on the fly, it does not need any I/O; and materializing the final output can cost up to I/Os. Totally the program incurs only I/Os, compared to for the nonpipelined code in Section 2.3.