Perfect Hash Function Generation on the GPU with RecSplit

  • Links:PDF
  • A Perfect Hash Function is a mapping from n keys each to a number in {1,...,m}, such that there are no collisions, i.e., each key maps to a different number. The perfect hash function is constructed once (the set of keys is static) and used for queries. Parameters to optimize are memory consumption and "fill degree", i.e., n/m. Current methods need about 1.5 to 2.5 bits per key in practice and there is a theoretical lower bound of 1.44 bits per key. Common values for the fill degree are n/m=0.98 or n/m=0.81. A perfect hash function that achieves a fill degree of n/m=1 (n=m) is called Minimal Perfect Hash Function.

    Applications of Perfect Hash Functions are manifold. For example, a space-efficient static key-value store can be implemented by interpreting the hash value as a memory address. By storing a signature of objects, an AMQ filter can be developed that can answer membership queries ("is x in set S") with an adjustable error rate (see also: Bloom Filter). Often, one would like to have small, unique representatives of complex objects in order to handle them more efficiently - Perfect Hash Functions can provide that mapping. In the Google PageRank algorithm, a Perfect Hash Function can handle the mapping from URLs to node indices.

    Constructing a Perfect Hash Function with n=108 elements takes about 40 seconds using current sequential algorithms. The methods are often based on first reducing the input size and then trying many normal Hash Functions on small sets until they are collision free. Since trying out hash functions is a significant part of the runtime and can be well parallelized, the construction of Perfect Hash Functions is an ideal application for GPUs.

    Task

    Design, develop, and benchmark a parallel construction algorithm for Perfect Hash Functions on the GPU in CUDA.

    Requirements

    - Interest in space efficient data structures
    - Knowledge of C++ and CUDA

    References

    - "Hash, displace, and compress" (Belazzougui, Botelho, Dietzfelbinger)
    - RecSplit: Minimal Perfect Hashing via Recursive Splitting" (Esposito, Mueller Graf, Vigna)