Optimization of Fault-Tolerance Strategies for Workflow Applications

Research topic and goals

In this project, we aim at finding efficient fault-tolerant scheduling schemes for workflow applications that can be expressed as a directed acyclic graph (DAG) of tasks.

Checkpointing-recovery is the traditional fault-tolerance technique when it comes to resilience for large-scale platforms. Unfortunately, as platform scale increases, checkpoints must become more frequent to accommodate with the increasing Mean Time Between Failure (MTBF). As such, it is expected that checkpoint-recovery will become a major bottleneck for applications running on post-petascale platforms.

We first focus on replication as a way of mitigating the checkpointing-recovery overhead. A task can be checkpointed and/or replicated, so that if a single replica fails, no recovery is needed. Our goal is to decide which task to checkpoint, which task to replicate, and how much resource should be allocated to each task for the execution of general workflow applications. For that, we first need to derive a clear model for replication, as there are many ways to implement it, even for a single task.

Results for 2016/2017

The initial work for this project has been focused towards using replication as a detection and correction mechanism for Silent Data Corruptions (SDC). Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique.

In this project, replication is combined with checkpointing to enable rollback and recovery when forward recovery is not possible, which occurs when too many replicas are corrupted. The goal is to find the right level of replication (duplication, triplication or more) needed to efficiently detect and correct silent errors at scale. We have provided a detailed analytical study for this framework.

Results for 2017/2018

We have extended these results for platforms subject to both silent and fail-stop errors. Fail-stop errors are immediately detected, unlike silent errors, and replication may also help tolerating such errors.

We have considered two flavors of replication: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck.

We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.

While the previous work had focused on applications for which we can decide at which frequency we can checkpoint, and our aim has been to find the optimal checkpointing period, we have also initiated the study for linear chains of parallel tasks. The aim is then to decide which tasks to checkpoint and/or replicate. In this case, we have provided an optimal dynamic programming algorithm, and an extensive set of simulations to assess (i) in which scenarios checkpointing performs better than replication, or vice-versa; and (ii) in which scenarios the combination of both methods is useful, and to what extent.

Visits and meetings

Aurélien Cavelan (INRIA) visited Franck Cappello (ANL) in Chicago for three months (March, April, and May 2016) to initiate the project. Furthermore, we have been meeting regularly in the previous years. In particular, we have been attending the SC conference (November 2016 and November 2017), where we had extensive discussions to make progress. We represented the JLESC at the Inria booth during these conferences.

While not meeting in person, we have stayed in close collaboration through regular Skype meetings, which allowed us to make progress on the project.

Impact and publications

Two papers have been accepted to FTXS’17 (Benoit et al. 2017),(Benoit et al. 2017).

The most recent work combining fail-stop and silent errors has been submitted to JPDC (“Coping with silent and fail-stop errors at scale by combining replication and checkpointing”).

The initial work on linear chain of tasks will be submitted to APDCM’18 (“Combining Checkpointing and Replication for Linear Workflows”).

  1. Benoit, Anne, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, and Hongyang Sun. 2017. “Identifying The Right Replication Level to Detect and Correct Silent Errors at Scale.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017identifying,
      title = {Identifying the right replication level to detect and correct silent errors at scale},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
      year = {2017},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      keywords = {mine,Workshop}
    }
    
  2. Benoit, Anne, Aurélien Cavelan, Valentin Le Fèvre, and Yves Robert. 2017. “Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017optimal,
      title = {Optimal checkpointing period with replicated execution on heterogeneous platforms},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Le F{\`e}vre, Valentin and Robert, Yves},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      year = {2017},
      keywords = {mine,Workshop}
    }
    

Future plans

There remains a lot to explore for workflow applications, consisting of tasks. We have so far focused only on duplication in this case, but one may want to consider different replication levels (duplication, triplication or more) to different tasks, depending upon their criticality in terms of longest paths, number of successors, etc. This may be even more important when considering a general directed acyclic graph of tasks, rather than restricting to linear chains of tasks. This topic is called partial replication, and even though it has been empirically studied by some previous work, designing an optimal strategy that combines partial redundancy and checkpointing and analyzing its efficacy remain to be done.

Also, we have not yet explored how replication may help correct silent data corruptions in workflow applications, since our initial study considers only fail-stop errors. Combining both types of errors for workflow applications is a challenging perspective to our work.

Finally, our initial goal was to target pipelined workflow applications, where data continuously enters the workflow, and where the objective is to maximize the throughput that can be achieved. This causes several new challenges that we hope to address in the future.

References

  1. Benoit, Anne, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, and Hongyang Sun. 2017. “Identifying The Right Replication Level to Detect and Correct Silent Errors at Scale.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017identifying,
      title = {Identifying the right replication level to detect and correct silent errors at scale},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
      year = {2017},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      keywords = {mine,Workshop}
    }
    
  2. Benoit, Anne, Aurélien Cavelan, Valentin Le Fèvre, and Yves Robert. 2017. “Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.” In Proceedings Of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017optimal,
      title = {Optimal checkpointing period with replicated execution on heterogeneous platforms},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Le F{\`e}vre, Valentin and Robert, Yves},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      year = {2017},
      keywords = {mine,Workshop}
    }
    
  3. Casanova, Henri, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2015. “On The Impact of Process Replication on Executions of Large-Scale Parallel Applications with Coordinated Checkpointing.” Future Generation Comp. Syst. 51: 7–19. doi:10.1016/j.future.2015.04.003.
    @article{CasanovaEtAl2015,
      author = {Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
      title = {On the impact of process replication on executions of large-scale
                     parallel applications with coordinated checkpointing},
      journal = {Future Generation Comp. Syst.},
      volume = {51},
      pages = {7--19},
      year = {2015},
      url = {http://dx.doi.org/10.1016/j.future.2015.04.003},
      doi = {10.1016/j.future.2015.04.003},
      timestamp = {Thu, 31 Mar 2016 15:45:29 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/journals/fgcs/CasanovaRVZ15},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }
    
  4. Herault, Thomas, and Yves Robert. 2015. Fault-Tolerance Techniques For High-Performance Computing. 1st ed. Springer Publishing Company, Incorporated.
    @book{HeraultEtAl2015,
      author = {Herault, Thomas and Robert, Yves},
      title = {Fault-Tolerance Techniques for High-Performance Computing},
      year = {2015},
      isbn = {3319209426, 9783319209425},
      edition = {1st},
      publisher = {Springer Publishing Company, Incorporated}
    }
    
  5. Bougeret, Marin, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. “Using Group Replication for Resilience on Exascale Systems.” IJHPCA 28 (2): 210–24. doi:10.1177/1094342013505348.
    @article{BougeretEtAl2014,
      author = {Bougeret, Marin and Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
      title = {Using group replication for resilience on exascale systems},
      journal = {{IJHPCA}},
      volume = {28},
      number = {2},
      pages = {210--224},
      year = {2014},
      url = {http://dx.doi.org/10.1177/1094342013505348},
      doi = {10.1177/1094342013505348},
      timestamp = {Mon, 02 Jun 2014 09:36:01 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijhpca/BougeretCRVZ14},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }
    
  6. Ferreira, Kurt B., Jon Stearley, James H. Laros III, Ron Oldfield, Kevin T. Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, and Dorian C. Arnold. 2011. “Evaluating The Viability of Process Replication Reliability for Exascale Systems.” In Conference On High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, 44:1–44:12. doi:10.1145/2063384.2063443.
    @inproceedings{FerreiraEtAl2011,
      author = {Ferreira, Kurt B. and Stearley, Jon and III, James H. Laros and Oldfield, Ron and Pedretti, Kevin T. and Brightwell, Ron and Riesen, Rolf and Bridges, Patrick G. and Arnold, Dorian C.},
      title = {Evaluating the viability of process replication reliability for exascale
                     systems},
      booktitle = {Conference on High Performance Computing Networking, Storage and Analysis,
                     {SC} 2011, Seattle, WA, USA, November 12-18, 2011},
      pages = {44:1--44:12},
      year = {2011},
      crossref = {DBLP:conf/sc/2011},
      url = {http://doi.acm.org/10.1145/2063384.2063443},
      doi = {10.1145/2063384.2063443},
      timestamp = {Tue, 30 Jun 2015 16:34:04 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/conf/sc/FerreiraSLOPBRBA11},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }