Checkpoint/Restart of/from lossy state

Research topic and goals

State compression is an important technique for reducing the problem posed by the bandwidth plateauing of the HDD-based storage systems.

The state saved by HPC executions in checkpoints is composed mostly of floating-point data (single or double precision). Two types of compressors have been developed for floating-point data. Lossless compressors keep all the initial information and try to reduce the size occupied by this information. Sophisticated lossless compression techniques may use entropy coding, runlengh encoding, dictionary coding for example (Lindstrom and Isenburg 2006), (Ibtesham et al. 2012). The compression factor generally observed on scientific datasets with lossless compression is 1.2 to 2 (the compressed data set is 1.2 to 2 times smaller than the original). Lossy compressors deliberately lose information in order to reduce the size of the initial data set further. Lossy compressors reach very variable compression ratios depending on the application and the error bounds set by the users. This could be from less than 10 for difficult-to-compress data sets (lossless compressors would achieve a compression ratio of 1.5 to 2 on some extreme cases) to 100 for easier-to-compress data sets. Lossy compressors can be error bounded or not. Lossy compressors that are not error bounded simply compress the data set as much as they can, with no guarantee on the error on each decompressed data. Their applicability is limited in HPC because users have expectations in terms of accuracy. Error bounded lossy compressors provide different error controls such as absolute error and relative error bounds. Some compressors also provide stratistical bounds. For example, SZ provides PSNR bound (lower). The user has a guarantee of a maximum loss of information quantified by the maximum compression/decompression errors. Users will set the error bound to match their accuracy expectations. Note that all lossy compressors keep all data points of the initial data set (none of the lossy compressors drop data points).

Few studies have been done on checkpointing and restart from lossy compressed states (Sasaki et al. 2015). These studies are limited in scope by studying only one application and one type of lossy compressor: NICAM with a lossy compressor based on wavelet and quantization (Sasaki et al. 2015) and ChaNGa with fpzip used as a lossy compressor (Ni et al. 2014). They already reveal two interesting points. First, checkpoints are composed of different variables that present different sensitivity to lossy compression. The correctness of the execution after restart depends on how each variable is compressed. In the cosmology simulation (ChaNGa), lossy compression of particle positions lead the execution to hang for a high compression level, while this does not happen when compressing other variable at the same level. Second, for the study of NICAM, the authors consider that an error of 1 percent on the final result is acceptable when restarting from lossy checkpoints. The rationale is that this magnitude of error is similar to those of sensor errors and model errors, while the compression factor exceeds 5.

In contrary to previous research that concentrated on few applications, we focus on simple problems used by many applications and try to understand how they behave. We are exploring diffusion and advection problems. The diffusion problem simulates the heat diffusion on a one-dimensional rod. The advection problem simulates a sine wave advecting to the right with periodic boundaries.

Results for 2015/2016

From our observations of restart from lossy checkpoints in dynamic simulations, we have observed that diffusion and advection problems are reacting differently. We are currently characterizing and formalizing this difference. We suspect that lossyness may be increased as the execution progresses: restarting from a lossy checkpoint near the beginning of the execution seems to affect much more the end results compared with restarting from a lossy checkpoint near the end of the execution. Moreover, we note that in a given variable field, some regions of the field are more chaotic than others. Some applications may not tolerate lossy compression on the entire field. But we suspect that they may tolerate lossy compression in the nonchaotic regions of the variable field.

Results for 2016/2017

We particularly explored an important problem: how to bound the impact of restarting from lossy checkpoint and guarantee that this impact does not affect the application result quality. To address this problem, we established a link between the compression error and the numerical error of the application. Applications using numerical methods suffer errors from truncation and discretization. We showed that compressing checkpoint with an error lower than the numerical errors allow to reserve the quality of the application results. We also demonstrated empirically that the error introduced by restarting from lossy checkpoint can be bounded.

Other researchers will be able to exploit the link we established between the compression error and the numerical errors to design better compression algorithms and numerical methods that tolerate better compression errors.

An important impact of this work on other disciplines that are using numerical simulation is that they can use lossy compression for checkpoint/restart since we established and verified guidance to fix the compression error that guarantee the quality of the numerical results.

John Calhoun wrote a Ph. D. manuscript and defended a Ph. D. dissertation presenting this research/results.

Results for 2017/2018

The paper submitted to IJHPCA on the results of this research has been accepted for publication.

Visits and meetings

Franck Cappello visits UIUC almost every week. We have a 30 minutes to 1 hour meeting almost each time. Jon Calhoun did an internship of 11 weeks at ANL.

Impact and publications

This research continues at Argonne National Laboratory focusing on restart from lossy checkpointing for iterative numerical methods in linear algebra and a paper has been submitted to a top ACM conference.

The results of this project motivated the submission of the NSF Aletheia project that has been awarded and is funded for 3 years.

Funded by the NSF Aletheia project, a Ph. D. student (Wang Chen) at UIUC is exploring how to detect corruption in lossy compressed results (e.g. checkpoints) of numerical simulations.

See (Calhoun et al. 2018) and (Calhoun 2017)

  1. Calhoun, Jon, Franck Cappello, Luke N. Olson, Marc Snir, and William D. Gropp. 2018. “Exploring The Feasibility of Lossy Compression for PDE Simulations.” Int. J. High Perform. Comput. Appl. 27. Sage Publications, Inc.
    @article{Calhoun18,
      author = {Calhoun, Jon and Cappello, Franck and Olson, Luke N. and Snir, Marc and Gropp, William D.},
      journal = {Int. J. High Perform. Comput. Appl.},
      publisher = {Sage Publications, Inc.},
      title = {Exploring the Feasibility of Lossy Compression for PDE Simulations},
      volume = {27},
      year = {2018}
    }
    
  2. Calhoun, Jon. 2017. “From Detection to Optimization: Impact of Soft Errors on High-Performance Computing Applications.” Ph. D. Manuscript: Https://Www.Ideals.Illinois.Edu/Handle/2142/98379.
    @unpublished{Calhoun17,
      author = {Calhoun, Jon},
      journal = {Ph. D. Manuscript: https://www.ideals.illinois.edu/handle/2142/98379},
      title = {From detection to optimization: impact of soft errors on high-performance computing applications},
      year = {2017}
    }
    

Future plans

We still need to make a formal link between the numerical and compression errors.

References

  1. Sasaki, N., K. Sato, T. Endo, and S. Matsuoka. 2015. “Exploration Of Lossy Compression for Application-Level Checkpoint/Restart.” In Parallel And Distributed Processing Symposium (IPDPS), 2015 IEEE International, 914–22.
    @inproceedings{SasakiETAl2015,
      author = {Sasaki, N. and Sato, K. and Endo, T. and Matsuoka, S.},
      booktitle = {Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International},
      title = {Exploration of Lossy Compression for Application-Level Checkpoint/Restart},
      year = {2015},
      pages = {914-922},
      month = may
    }
    
  2. Ni, Xiang, Tanzima Islam, Kathryn Mohror, Adam Moody, and Laxmikant V Kale. 2014. “Lossy Compression for Checkpointing: Fallible or Feasible?” In International Conference For High Performance Computing, Networking, Storage and Analysis (SC).
    @inproceedings{NiETAl2014,
      title = {Lossy compression for checkpointing: Fallible or feasible?},
      author = {Ni, Xiang and Islam, Tanzima and Mohror, Kathryn and Moody, Adam and Kale, Laxmikant V},
      booktitle = {International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
      year = {2014}
    }
    
  3. Ibtesham, Dewan, Dorian Arnold, Kurt B. Ferreira, and Patrick G. Bridges. 2012. “On The Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.” In Proceedings Of the 2011 International Conference on Parallel Processing - Volume 2, 302–11. Euro-Par’11. Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-642-29740-3_34.
    @inproceedings{IbteshamETAl2012,
      author = {Ibtesham, Dewan and Arnold, Dorian and Ferreira, Kurt B. and Bridges, Patrick G.},
      title = {On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance},
      booktitle = {Proceedings of the 2011 International Conference on Parallel Processing - Volume 2},
      series = {Euro-Par'11},
      year = {2012},
      isbn = {978-3-642-29739-7},
      location = {Bordeaux, France},
      pages = {302--311},
      numpages = {10},
      url = {http://dx.doi.org/10.1007/978-3-642-29740-3_34},
      doi = {10.1007/978-3-642-29740-3_34},
      acmid = {2238474},
      publisher = {Springer-Verlag},
      address = {Berlin, Heidelberg},
      keywords = {checkpoint data compression, checkpoint/restart, extreme scale fault-tolerance}
    }
    
  4. Lindstrom, P., and M. Isenburg. 2006. “Fast And Efficient Compression of Floating-Point Data.” IEEE Transactions On Visualization and Computer Graphics 12 (5): 1245–50.
    @article{LindstromETAl2006,
      author = {Lindstrom, P. and Isenburg, M.},
      journal = {IEEE Transactions on Visualization and Computer Graphics},
      title = {Fast and Efficient Compression of Floating-Point Data},
      year = {2006},
      volume = {12},
      number = {5},
      pages = {1245-1250},
      month = sep
    }