Optimization of Fault-Tolerance Strategies for Workflow Applications

Research topic and goals

In this project, we aim at finding efficient fault-tolerant scheduling schemes for workflow applications that can be expressed as a directed acyclic graph (DAG) of tasks.

Checkpointing-recovery is the traditional fault-tolerance technique when it comes to resilience for large-scale platforms. Unfortunately, as platform scale increases, checkpoints must become more frequent to accommodate with the increasing Mean Time Between Failure (MTBF). As such, it is expected that checkpoint-recovery will become a major bottleneck for applications running on post-petascale platforms.

We first focus on replication as a way of mitigating the checkpointing-recovery overhead. A task can be checkpointed and/or replicated, so that if a single replica fails, no recovery is needed. Our goal is to decide which task to checkpoint, which task to replicate, and how much resource should be allocated to each task for the execution of general workflow applications. For that, we first need to derive a clear model for replication, as there are many ways to implement it, even for a single task.

Results for 2016/2017

Work has been focused toward using replication as a detection and correction mechanism for Silent Data Corruptions (SDC). Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. In this work, replication is combined with checkpoiting to enable rollback and recovery when forward recovery is not possible, which occurs when too many replicas are corrupted. The goal is to find the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors at scale. As of today, we provide a detailed analytical study with formulas to decide the optimal parameters as a function of the error rate, checkpoint cost, and platform size.

Visits and meetings

Aurélien Cavelan (INRIA) visited Franck Cappello (ANL) in Chicago for three months (march-april-may 2016) to initiate the project. We are working in close collaboration to make progress.

Impact and publications

Two papers have been accepted to FTXS’17 (Benoit et al. 2017),(Benoit et al. 2017).

    Future plans

    Our work has been focused detecting and correcting silent data corruptions. We first plan to extend our current analytical study to account for both silent and fail-stop errors. Then, work will be directed toward more complex applications such as linear workflows or pipelined applications.

    References

    1. Benoit, Anne, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, and Hongyang Sun. 2017. “Identifying The Right Replication Level to Detect and Correct Silent Errors at Scale.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
      @inproceedings{benoitEtAl2017identifying,
        title = {Identifying the right replication level to detect and correct silent errors at scale},
        author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
        year = {2017},
        booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
        keywords = {mine,Workshop}
      }
      
    2. Benoit, Anne, Aurélien Cavelan, Valentin Le Fèvre, and Yves Robert. 2017. “Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
      @inproceedings{benoitEtAl2017optimal,
        title = {Optimal checkpointing period with replicated execution on heterogeneous platforms},
        author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Le F{\`e}vre, Valentin and Robert, Yves},
        booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
        year = {2017},
        keywords = {mine,Workshop}
      }
      
    3. Casanova, Henri, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2015. “On The Impact of Process Replication on Executions of Large-Scale Parallel Applications with Coordinated Checkpointing.” Future Generation Comp. Syst. 51: 7–19. doi:10.1016/j.future.2015.04.003.
      @article{CasanovaEtAl2015,
        author = {Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
        title = {On the impact of process replication on executions of large-scale
                       parallel applications with coordinated checkpointing},
        journal = {Future Generation Comp. Syst.},
        volume = {51},
        pages = {7--19},
        year = {2015},
        url = {http://dx.doi.org/10.1016/j.future.2015.04.003},
        doi = {10.1016/j.future.2015.04.003},
        timestamp = {Thu, 31 Mar 2016 15:45:29 +0200},
        biburl = {http://dblp.uni-trier.de/rec/bib/journals/fgcs/CasanovaRVZ15},
        bibsource = {dblp computer science bibliography, http://dblp.org}
      }
      
    4. Herault, Thomas, and Yves Robert. 2015. Fault-Tolerance Techniques For High-Performance Computing. 1st ed. Springer Publishing Company, Incorporated.
      @book{HeraultEtAl2015,
        author = {Herault, Thomas and Robert, Yves},
        title = {Fault-Tolerance Techniques for High-Performance Computing},
        year = {2015},
        isbn = {3319209426, 9783319209425},
        edition = {1st},
        publisher = {Springer Publishing Company, Incorporated}
      }
      
    5. Bougeret, Marin, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. “Using Group Replication for Resilience on Exascale Systems.” IJHPCA 28 (2): 210–24. doi:10.1177/1094342013505348.
      @article{BougeretEtAl2014,
        author = {Bougeret, Marin and Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
        title = {Using group replication for resilience on exascale systems},
        journal = {{IJHPCA}},
        volume = {28},
        number = {2},
        pages = {210--224},
        year = {2014},
        url = {http://dx.doi.org/10.1177/1094342013505348},
        doi = {10.1177/1094342013505348},
        timestamp = {Mon, 02 Jun 2014 09:36:01 +0200},
        biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijhpca/BougeretCRVZ14},
        bibsource = {dblp computer science bibliography, http://dblp.org}
      }
      
    6. Ferreira, Kurt B., Jon Stearley, James H. Laros III, Ron Oldfield, Kevin T. Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, and Dorian C. Arnold. 2011. “Evaluating The Viability of Process Replication Reliability for Exascale Systems.” In Conference On High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, 44:1–44:12. doi:10.1145/2063384.2063443.
      @inproceedings{FerreiraEtAl2011,
        author = {Ferreira, Kurt B. and Stearley, Jon and III, James H. Laros and Oldfield, Ron and Pedretti, Kevin T. and Brightwell, Ron and Riesen, Rolf and Bridges, Patrick G. and Arnold, Dorian C.},
        title = {Evaluating the viability of process replication reliability for exascale
                       systems},
        booktitle = {Conference on High Performance Computing Networking, Storage and Analysis,
                       {SC} 2011, Seattle, WA, USA, November 12-18, 2011},
        pages = {44:1--44:12},
        year = {2011},
        crossref = {DBLP:conf/sc/2011},
        url = {http://doi.acm.org/10.1145/2063384.2063443},
        doi = {10.1145/2063384.2063443},
        timestamp = {Tue, 30 Jun 2015 16:34:04 +0200},
        biburl = {http://dblp.uni-trier.de/rec/bib/conf/sc/FerreiraSLOPBRBA11},
        bibsource = {dblp computer science bibliography, http://dblp.org}
      }