Institute of Theoretical Informatics, Algorithmics II

Lossless Compression of Climate Data using Machine Learning

  • The use of new technologies, such as GPU boosters, has led to a dramatic increase in the computing power of HPC centers. This Development coupled with new climate models, which make better use of the computing power thanks to an improved internal structure leads to the fact that the bottleneck shifts away from the solution of the differential equations of the model calculations to the actual storage of the variables.

    In this thesis the use of Machine Learning (ML) algorithms for the development of novel compression algorithms for structured floating point data like climate data will be investigated and prototypically implemented. Due to the large amount of data, climate sciences offer an ideal basis for testing different machine learning methods. With about 800 TiB of data, the IMK is the largest institute at KIT using the resources of the SCC. This facilitates the application and testing of all three types of ML processes: supervised, unsupervised and reinforcement learning.

    The aim of this thesis is to develop a prediction-based compression algorithm. Here, the datapoints in the dataset are traversed individually and a prediction for the current value is made. Afterwards the difference (also called residual) between the prediction and the true value is calculated. This difference is finally encoded and stored on disc.  With the help of the prediction method, the traversing strategy and the residual, the data can be reconstructed without any loss. The more accurate the prediction, the smaller the difference and thus the final filesize will be. Machine Learning methods can help in the development of new traversing strategies and better prediction methods.

     

    Work on the thesis can begin imminently.
     

    Tasks

    • Familiarization with the data formats netCDF and HDF5
    • Evaluation of ML procedures for the prediction of datapoints (e.g. supervised, unsupervised, reinforcement learning)
    • Engineering of the coding pipeline with regard to performance and compression factor
       

    Requirements

    • Master student of computer science, information management or business informatics
    • Programming experience in Python
       

    Desirable skills

    • Ideally first experience in the use of ML processes
    • Experience in other programming languages like C++ or Rust

     

    Contact

    Dr. Uğur Çayoğlu