Checkpoint/Restart of/from lossy state

Research topic and goals

State compression is an important technique for reducing the problem posed by the bandwidth plateauing of the HDD-based storage systems.

The state saved by HPC executions in checkpoints is composed mostly of floating-point data (single or double precision). Two types of compressors have been developed for floating-point data. Lossless compressors keep all the initial information and try to reduce the size occupied by this information. Sophisticated lossless compression techniques may use entropy analysis, duplicate string elimination, or bit reduction, for example (Lindstrom and Isenburg 2006), (Ibtesham et al. 2012). The compression factor generally observed with lossless compression is 1.2 to 4 (the compressed data set is 1.2 to 4 times smaller than the original). Lossy compressors deliberately lose information in order to reduce the size of the initial data set further. Lossy compressors reach very variable compression ratios depending on the application. This could be from 3 to 4 for difficult-to-compress data sets (lossless compressors would achieve a compression ratio of 1.5 to 2 on those data sets) to x100 for easier-to-compress data sets. Lossy compressors can be error bounded or not. Lossy compressors that are not error bounded simply compress the data set as much as they can, with no guarantee on the error on each decompressed data. Their applicability is limited in HPC because users have expectations in terms of accuracy. Error bounded lossy compressors allow users to set a maximum (or relative) compression/decompression error. The maximum error is the maximum of the difference between any initial data in the data set and its decompressed version (from its lossy compressed version). The user has a guarantee of a maximum loss of information quantified by the maximum compression/decompression errors. Users will set the maximum error to match their accuracy expectations. Note that all lossy compressors keep all data points of the initial data set (none of the lossy compressors drop data points).

Few studies have been done on checkpointing and restart from lossy compressed states (Sasaki et al. 2015). These studies are limited in scope by studying only one application and one type of lossy compressor: NICAM with a lossy compressor based on wavelet and quantization (Sasaki et al. 2015) and ChaNGa with the fpzip lossy compressor (Ni et al. 2014). They already reveal two interesting points/First, checkpoints are composed of different variables that present different sensitivity to lossy compression. The correctness of the execution after restart depends on how each variable is compressed. In the cosmology simulation (ChaNGa), lossy compression of particle positions lead the execution to hang for a high compression level, while this does not happen when compressing other variable at the same level. Second, for the study of NICAM, the authors consider that an error of 1 percent on the final result is acceptable when restarting from lossy checkpoints. The rationale is that this magnitude of error is similar to those of sensor errors and model errors, while the compression factor exceeds 5.

In contrary to previous research that concentrated on few applications, we focus on simple problems used by many applications and try to understand how they behave. We are exploring diffusion and advection problems. The diffusion problem simulates the heat diffusion on a one-dimensional rod. The advection problem simulates a sine wave advecting to the right with periodic boundaries.

Results for 2015/2016

From our observations of restart from lossy checkpoints in dynamic simulations, we have observed that diffusion and advection problems are reacting differently. We are currently characterizing and formalizing this difference. We suspect that lossyness may be increased as the execution progresses: restarting from a lossy checkpoint near the beginning of the execution seems to affect much more the end results compared with restarting from a lossy checkpoint near the end of the execution. Moreover, we note that in a given variable field, some regions of the field are more chaotic than others. Some applications may not tolerate lossy compression on the entire field. But we suspect that they may tolerate lossy compression in the nonchaotic regions of the variable field.

Results for 2016/2017

We particularly explored an important problem: how to bound the impact of restarting from lossy checkpoint and guarantee that this impact does not affect the application result quality. To address this problem, we established a link between the compression error and the numerical error of the application. Applications using numerical methods suffer errors from truncation and discretization. We showed that compressing checkpoint with an error lower than the numerical errors allow to reserve the quality of the application results. We also demonstrated empirically that the error introduced by restarting from lossy checkpoint can be bounded.

Other researchers will be able to exploit the link we established between the compression error and the numerical errors to design better compression algorithms and numerical methods that tolerate better compression errors.

An important impact of this work on other disciplines that are using numerical simulation is that they can use lossy compression for checkpoint/restart since we established and verified guidance to fix the compression error that guarantee the quality of the numerical results.

Visits and meetings

Franck Cappello visits UIUC almost every week. We have a 30 minutes to 1 hour meeting almost each time. Jon Calhoun did an internship of 11 weeks at ANL.

Impact and publications

See (Calhoun et al.).

  1. Calhoun, Jon, Franck Cappello, Luke Olson, Marc Snir, and William Gropp. “Exploring The Feasibility of Lossy Compression for PDE Simulations.”
    @unpublished{Calhoun17,
      author = {Calhoun, Jon and Cappello, Franck and Olson, Luke and Snir, Marc and Gropp, William},
      note = {},
      numpages = {12},
      title = {Exploring the Feasibility of Lossy Compression for PDE Simulations},
      year = {(Submitted) 2017}
    }
    

Future plans

We still need to make a formal link between the numerical and compression errors.

References

  1. Sasaki, N., K. Sato, T. Endo, and S. Matsuoka. 2015. “Exploration Of Lossy Compression for Application-Level Checkpoint/Restart.” In Parallel And Distributed Processing Symposium (IPDPS), 2015 IEEE International, 914–22.
    @inproceedings{SasakiETAl2015,
      author = {Sasaki, N. and Sato, K. and Endo, T. and Matsuoka, S.},
      booktitle = {Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International},
      title = {Exploration of Lossy Compression for Application-Level Checkpoint/Restart},
      year = {2015},
      pages = {914-922},
      month = may
    }
    
  2. Ni, Xiang, Tanzima Islam, Kathryn Mohror, Adam Moody, and Laxmikant V Kale. 2014. “Lossy Compression for Checkpointing: Fallible or Feasible?” In International Conference For High Performance Computing, Networking, Storage and Analysis (SC).
    @inproceedings{NiETAl2014,
      title = {Lossy compression for checkpointing: Fallible or feasible?},
      author = {Ni, Xiang and Islam, Tanzima and Mohror, Kathryn and Moody, Adam and Kale, Laxmikant V},
      booktitle = {International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
      year = {2014}
    }
    
  3. Ibtesham, Dewan, Dorian Arnold, Kurt B. Ferreira, and Patrick G. Bridges. 2012. “On The Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.” In Proceedings Of the 2011 International Conference on Parallel Processing - Volume 2, 302–11. Euro-Par’11. Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-642-29740-3_34.
    @inproceedings{IbteshamETAl2012,
      author = {Ibtesham, Dewan and Arnold, Dorian and Ferreira, Kurt B. and Bridges, Patrick G.},
      title = {On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance},
      booktitle = {Proceedings of the 2011 International Conference on Parallel Processing - Volume 2},
      series = {Euro-Par'11},
      year = {2012},
      isbn = {978-3-642-29739-7},
      location = {Bordeaux, France},
      pages = {302--311},
      numpages = {10},
      url = {http://dx.doi.org/10.1007/978-3-642-29740-3_34},
      doi = {10.1007/978-3-642-29740-3_34},
      acmid = {2238474},
      publisher = {Springer-Verlag},
      address = {Berlin, Heidelberg},
      keywords = {checkpoint data compression, checkpoint/restart, extreme scale fault-tolerance}
    }
    
  4. Lindstrom, P., and M. Isenburg. 2006. “Fast And Efficient Compression of Floating-Point Data.” IEEE Transactions On Visualization and Computer Graphics 12 (5): 1245–50.
    @article{LindstromETAl2006,
      author = {Lindstrom, P. and Isenburg, M.},
      journal = {IEEE Transactions on Visualization and Computer Graphics},
      title = {Fast and Efficient Compression of Floating-Point Data},
      year = {2006},
      volume = {12},
      number = {5},
      pages = {1245-1250},
      month = sep
    }