In recent years there has been an exponential increase in publicly available DNA data. This data encodes a huge amount of information on the ancestry of organisms, their evolution, and their abilities and properties. Until recently, it was not possible to search this data in its entirety. Providing online search facilities would enable research into epidemiology, basic science of disease, and antimicrobial resistance. Such search would be as enabling for microbial genomics as natural text search engines are for many everyday tasks.
Recently the Bitsliced Genomic Signature Index (BIGSI) was introduced allowing the search on all viral and bacterial genomic sequence reads in the European Nucleotide Archive (ENA). As of December 2016, the data set consists of 447,833 genomes using 170 TiB of space. However, longer queries can take up to half an hour on a high performance machine.
In this thesis we introduce the Compact Bit-Sliced Signature Index (COBS), a new index variant with significantly improved space and time requirements. In our empirical evaluation we show that execution time is improved by two orders of magnitude while space requirements are cut in half. For longer queries the execution time can be reduced from half an hour to under five seconds. Additionally, we enrich the query results with quality information which can be used for ranking the result set.
Several techniques are introduced which make these improvements possible. A compact data structure layout is used to achieve large decreases in index size with only a marginal increase in execution time. Additionally, algorithm engineering techniques are employed to show why the choice of index parameters made by BIGSI was not optimal and how optimal parameters can be chosen. This insight is used to significantly reduce execution time. Next, a new set of equations is introduced to enrich the result with quality information which gives the user more insight into the composition of the result and could be used to increase effectiveness in the future. Lastly, we add several low-level optimizations, exploiting modern hardware, which further lower execution time.
To evaluate the techniques, several experiments are conducted which give insight into index construction, execution performance, space-time trade-offs, false positive distributions and the impact of the engineering optimizations. Multiple areas where future work can be conducted are introduced. Finally, the steps required for the creation of a dynamic online index which supports sharding on multiple machines are described.
While COBS was constructed to address the challenges of searching large genomic collections, the main findings can be applied to a wide range of problems.
COBS is written in C++17 and is available at https://github.com/devgg/cobs.