Checkpoint/Restart of/from lossy state

Research topic and goals

State compression is an important technique for reducing the problem posed by the bandwidth plateauing of the HDD-based storage systems.

The state saved by HPC executions in checkpoints is composed mostly of floating-point data (single or double precision). Two types of compressors have been developed for floating-point data. Lossless compressors keep all the initial information and try to reduce the size occupied by this information. Sophisticated lossless compression techniques may use entropy coding, runlengh encoding, dictionary coding for example (Lindstrom and Isenburg 2006), (Ibtesham et al. 2012). The compression factor generally observed on scientific datasets with lossless compression is 1.2 to 2 (the compressed data set is 1.2 to 2 times smaller than the original). Lossy compressors deliberately lose information in order to reduce the size of the initial data set further. Lossy compressors reach very variable compression ratios depending on the application and the error bounds set by the users. This could be from less than 10 for difficult-to-compress data sets (lossless compressors would achieve a compression ratio of 1.5 to 2 on some extreme cases) to 100 for easier-to-compress data sets. Lossy compressors can be error bounded or not. Lossy compressors that are not error bounded simply compress the data set as much as they can, with no guarantee on the error on each decompressed data. Their applicability is limited in HPC because users have expectations in terms of accuracy. Error bounded lossy compressors provide different error controls such as absolute error and relative error bounds. Some compressors also provide statistical bounds. For example, SZ provides PSNR bound (lower). The user has a guarantee of a maximum loss of information quantified by the maximum compression/decompression errors. Users will set the error bound to match their accuracy expectations. Note that all lossy compressors keep all data points of the initial data set (none of the lossy compressors drop data points).

Few studies have been done on checkpointing and restart from lossy compressed states (Sasaki et al. 2015). These studies are limited in scope by studying only one application and one type of lossy compressor: NICAM with a lossy compressor based on wavelet and quantization (Sasaki et al. 2015) and ChaNGa with fpzip used as a lossy compressor (Ni et al. 2014). They already reveal two interesting points. First, checkpoints are composed of different variables that present different sensitivity to lossy compression. The correctness of the execution after restart depends on how each variable is compressed. In the cosmology simulation (ChaNGa), lossy compression of particle positions lead the execution to hang for a high compression level, while this does not happen when compressing other variable at the same level. Second, for the study of NICAM, the authors consider that an error of 1 percent on the final result is acceptable when restarting from lossy checkpoints. The rationale is that this magnitude of error is similar to those of sensor errors and model errors, while the compression factor exceeds 5.

In contrary to previous research that concentrated on few applications, we focus on simple problems used by many applications and try to understand how they behave. We are exploring diffusion and advection problems. The diffusion problem simulates the heat diffusion on a one-dimensional rod. The advection problem simulates a sine wave advecting to the right with periodic boundaries.

Results for 2015/2016

From our observations of restart from lossy checkpoints in dynamic simulations, we have observed that diffusion and advection problems are reacting differently. We are currently characterizing and formalizing this difference. We suspect that lossyness may be increased as the execution progresses: restarting from a lossy checkpoint near the beginning of the execution seems to affect much more the end results compared with restarting from a lossy checkpoint near the end of the execution. Moreover, we note that in a given variable field, some regions of the field are more chaotic than others. Some applications may not tolerate lossy compression on the entire field. But we suspect that they may tolerate lossy compression in the nonchaotic regions of the variable field.

Results for 2016/2017

We particularly explored an important problem: how to bound the impact of restarting from lossy checkpoint and guarantee that this impact does not affect the application result quality. To address this problem, we established a link between the compression error and the numerical error of the application. Applications using numerical methods suffer errors from truncation and discretization. We showed that compressing checkpoint with an error lower than the numerical errors allow to reserve the quality of the application results. We also demonstrated empirically that the error introduced by restarting from lossy checkpoint can be bounded.

Other researchers will be able to exploit the link we established between the compression error and the numerical errors to design better compression algorithms and numerical methods that tolerate better compression errors.

An important impact of this work on other disciplines that are using numerical simulation is that they can use lossy compression for checkpoint/restart since we established and verified guidance to fix the compression error that guarantee the quality of the numerical results.

Jon Calhoun wrote a Ph. D. manuscript and defended a Ph. D. dissertation presenting this research/results.

Results for 2017/2018

The paper submitted to IJHPCA on the results of lossy compression for advection and diffusion problems has been accepted for publication.

Results for 2018/2019

A new collaboration has started with Luc Giraud and Emmanuel Aggulo (Inria) on lossy compression for linear solvers (GMRES). The goal is to study the convergence of GMRES if lossy compression is used at each iteration or for only some iterations. The first case will allow to understand how lossy compression could be used to compress the solver data in memory. The second case will inform about the possibility to restart GMES from lossy checkpoints. The research has started, a Postdoc has been hired on the Inria side.

Results for 2019/2020

The first results experimental of the new collaboration were presented during the workshop at Knoxville. Nick Schenkels then visited Argonne for 2 weeks. During the two weeks we decided to focus the work on Flexible GMRES. Franck Cappello visited Bordeaux in June 2019. There were several meetings about the collaboration. Nick Schenkels visited Argonne on February 2020 for 1 week. Discussions were about the draft of a paper on the theoretical results of this collaboration. The results of the collaboration will be presented at the JLESC workshop at Bonn in April 2020.

Two graduate students of Jon Calhoun (Robert Underwood and Tasmia Reza) spent time at Argonne. The results of their internships produced two publications. Tasmia’s paper is published at DRBSD’19. Robert’s paper is published at IPDPS’20.

Results for 2020/2021

The research on lossy compression for FGMRES algorithm has progressed. FGMRES requires the storage of a basis for the search space and another one for the space spanned by the residuals. We showed that the vectors spanning this search space can be compressed by looking at the combination of FGMRES and compression in the context of inexact Krylov subspace methods. This allows us to derive a bound on the normwise relative compression error in each iteration. We use this bound to formulate a number of different practical compression strategies, and validate and compare them through numerical experiments.

Visits and meetings

Franck Cappello visits UIUC almost every week. We have a 30 minutes to 1 hour meeting almost each time.
Jon Calhoun did an internship of 11 weeks at ANL.
Franck Cappello visited Inria Bordeaux for 1 week in July 2018.
Nick Schenkels visited Argonne 2 weeks in April 2019.
Franck Cappello visited Inria Bordeaux for 1 week in July 2019.
Nick Schenkels visited Argonne 1 week in February 2020.

Impact and publications

Papers

The paper from this collaboration submitted to IJHPCA has been accepted for publication. See (Calhoun et al. 2018) and (Calhoun 2017)

This research continues at Argonne National Laboratory focusing on restart from lossy checkpointing for iterative numerical methods in linear algebra. A paper has been accepted at ACM HPDC 2018 on that topic (Tao et al. 2018). Another paper is under preparation for submission on lossy compression for flexible GMRES (Schenkels et al. 2020).

A research report has been produced in 2020 and submitted to publication (Agullo et al. 2020).

Agullo, Emmanuel, Franck Cappello, Sheng Di, Luc Giraud, Xin Liang, and Nick Schenkels. 2020. “Exploring Variable Accuracy Storage through Lossy Compression Techniques in Numerical Linear Algebra: a First Application to Flexible GMRES.” Research Report RR-9342. Inria Bordeaux Sud-Ouest. https://hal.inria.fr/hal-02572910.

@techreport{agullo:hal-02572910,
  title = {{Exploring variable accuracy storage through lossy 
  compression techniques in numerical linear algebra: a first application 
  to flexible GMRES}},
  author = {Agullo, Emmanuel and Cappello, Franck and Di, Sheng and Giraud, Luc and Liang, Xin and Schenkels, Nick},
  url = {https://hal.inria.fr/hal-02572910},
  type = {Research Report},
  number = {RR-9342},
  institution = {{Inria Bordeaux Sud-Ouest}},
  year = {2020},
  month = may,
  keywords = {Mixed precision ; Lossy compression ; Flexible GMRES ; 
  Inexact Krylov ; Compression avec perte ; Pr{\'e}cision mixte},
  pdf = {https://hal.inria.fr/hal-02572910v2/file/RR-9342.pdf},
  hal_id = {hal-02572910},
  hal_version = {v2}
}

Schenkels, Nick, Emmanuel Agullo, Luc Giraud, Xin Liangy, Sheng Diy, and Franck Cappello. 2020. “Flexible Generalized Minimal Residual Method with a Compressed Search Space.”

@unpublished{Schenkels20,
  author = {Schenkels, Nick and Agullo, Emmanuel and Giraud, Luc and Liangy, Xin and Diy, Sheng and Cappello, Franck},
  title = {Flexible generalized minimal residual method with a compressed search space},
  year = {2020},
  note = {(to be submitted)}
}

Tao, Dingwen, Sheng Di, Xin Liang, Zizhong Chen, and Franck Cappello. 2018. “Improving Performance of Iterative Methods by Lossy Checkponting.” In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018, Tempe, AZ, USA, June 11-15, 2018, 52–65.

@inproceedings{HPDC2018,
  author = {Tao, Dingwen and Di, Sheng and Liang, Xin and Chen, Zizhong and Cappello, Franck},
  title = {Improving performance of iterative methods by lossy checkponting},
  booktitle = {Proceedings of the 27th International Symposium on High-Performance
                 Parallel and Distributed Computing, {HPDC} 2018, Tempe, AZ, USA, June
                 11-15, 2018},
  pages = {52--65},
  year = {2018}
}

Calhoun, Jon, Franck Cappello, Luke N. Olson, Marc Snir, and William D. Gropp. 2018. “Exploring the Feasibility of Lossy Compression for PDE Simulations.” Int. J. High Perform. Comput. Appl. 27.

@article{Calhoun18,
  author = {Calhoun, Jon and Cappello, Franck and Olson, Luke N. and Snir, Marc and Gropp, William D.},
  journal = {Int. J. High Perform. Comput. Appl.},
  publisher = {Sage Publications, Inc.},
  title = {Exploring the Feasibility of Lossy Compression for PDE Simulations},
  volume = {27},
  year = {2018}
}

Calhoun, Jon. 2017. “From Detection to Optimization: Impact of Soft Errors on High-Performance Computing Applications.” Ph. D. Manuscript: Https://Www.ideals.illinois.edu/Handle/2142/98379.

@unpublished{Calhoun17,
  author = {Calhoun, Jon},
  journal = {Ph. D. Manuscript: https://www.ideals.illinois.edu/handle/2142/98379},
  title = {From detection to optimization: impact of soft errors on high-performance computing applications},
  year = {2017}
}

Funding

The results of this project motivated the submission of the NSF Aletheia project that has been awarded and is funded for 3 years.

Funded by the NSF Aletheia project, a Ph. D. student (Wang Chen) at UIUC is exploring how to detect corruption in lossy compressed results (e.g. checkpoints) of numerical simulations.

Another student is funded on the Aletheia project on the Argonne side. Sihuan Li from U.C. Riverside is doing a long term internship at Argonne on making lossy compression tolerant to SDC.

Inria postdoc funding for 16 months: Nick Schenkels

Impact on other projects

The team of Jon Calhoun at Clemson University is continuing the research on this topic. They are exploring how lossy compression affect optimal checkpoint intervals and how the nature of the compression error affect the stability of numerical methods.

This project also led to discussions with the ECP NWCHEMeX team that is developing a new version of the quantum chemistry code NWCHEM for exascale. The new version of the code will use lossy checkpointing to reduce as much as possible the fault tolerance overhead.

Future plans

Through the collaboration with Luc Giraud, we hope to make a formal link between the numerical and compression errors.

References

Underwood, Robert, Jon Calhoun, Sheng Di, and Franck Cappello. 2020. “FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework For Scientific Data.” In 2020 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020, New Orleans, USA, May 18-22, 2020. IEEE.

@inproceedings{UnderwoodEtAl2020,
  author = {Underwood, Robert and Calhoun, Jon and Di, Sheng and Cappello, Franck},
  title = {FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for
        Scientific Data},
  booktitle = {2020 {IEEE} International Parallel and Distributed Processing Symposium,
                 {IPDPS} 2020, New Orleans, USA, May 18-22, 2020},
  pages = {},
  publisher = {{IEEE}},
  year = {2020},
  url = {}
}

Reza, Tasmia, Jon Calhoun, Kristopher Keipert, Sheng Di, and Franck Cappello. 2019. “ Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem.” In 2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD).

@inproceedings{RezaEtAl2019,
  author = {Reza, Tasmia and Calhoun, Jon and Keipert, Kristopher and Di, Sheng and Cappello, Franck},
  title = { Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration 
        of NWChem},
  booktitle = {2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big 
        Scientific Data (DRBSD)},
  year = {2019},
  volume = {},
  number = {},
  pages = {},
  doi = {},
  issn = {},
  month = nov
}

Sasaki, N., K. Sato, T. Endo, and S. Matsuoka. 2015. “Exploration of Lossy Compression for Application-Level Checkpoint/Restart.” In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, 914–22.

@inproceedings{SasakiETAl2015,
  author = {Sasaki, N. and Sato, K. and Endo, T. and Matsuoka, S.},
  booktitle = {Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International},
  title = {Exploration of Lossy Compression for Application-Level Checkpoint/Restart},
  year = {2015},
  pages = {914-922},
  month = may
}

Ni, Xiang, Tanzima Islam, Kathryn Mohror, Adam Moody, and Laxmikant V Kale. 2014. “Lossy Compression for Checkpointing: Fallible or Feasible?” In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

@inproceedings{NiETAl2014,
  title = {Lossy compression for checkpointing: Fallible or feasible?},
  author = {Ni, Xiang and Islam, Tanzima and Mohror, Kathryn and Moody, Adam and Kale, Laxmikant V},
  booktitle = {International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  year = {2014}
}

Ibtesham, Dewan, Dorian Arnold, Kurt B. Ferreira, and Patrick G. Bridges. 2012. “On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.” In Proceedings of the 2011 International Conference on Parallel Processing - Volume 2, 302–11. Euro-Par’11. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-642-29740-3_34.

@inproceedings{IbteshamETAl2012,
  author = {Ibtesham, Dewan and Arnold, Dorian and Ferreira, Kurt B. and Bridges, Patrick G.},
  title = {On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance},
  booktitle = {Proceedings of the 2011 International Conference on Parallel Processing - Volume 2},
  series = {Euro-Par'11},
  year = {2012},
  isbn = {978-3-642-29739-7},
  location = {Bordeaux, France},
  pages = {302--311},
  numpages = {10},
  url = {http://dx.doi.org/10.1007/978-3-642-29740-3_34},
  doi = {10.1007/978-3-642-29740-3_34},
  acmid = {2238474},
  publisher = {Springer-Verlag},
  address = {Berlin, Heidelberg},
  keywords = {checkpoint data compression, checkpoint/restart, extreme scale fault-tolerance}
}

Lindstrom, P., and M. Isenburg. 2006. “Fast and Efficient Compression of Floating-Point Data.” IEEE Transactions on Visualization and Computer Graphics 12 (5): 1245–50.

@article{LindstromETAl2006,
  author = {Lindstrom, P. and Isenburg, M.},
  journal = {IEEE Transactions on Visualization and Computer Graphics},
  title = {Fast and Efficient Compression of Floating-Point Data},
  year = {2006},
  volume = {12},
  number = {5},
  pages = {1245-1250},
  month = sep
}