Optimization of Fault-Tolerance Strategies for Workflow Applications

Research topic and goals

In this project, we aim at finding efficient fault-tolerant scheduling schemes for workflow applications that can be expressed as a directed acyclic graph (DAG) of tasks.

Checkpointing-recovery is the traditional fault-tolerance technique when it comes to resilience for large-scale platforms. Unfortunately, as platform scale increases, checkpoints must become more frequent to accommodate with the increasing Mean Time Between Failure (MTBF). As such, it is expected that checkpoint-recovery will become a major bottleneck for applications running on post-petascale platforms.

We first focus on replication as a way of mitigating the checkpointing-recovery overhead. A task can be checkpointed and/or replicated, so that if a single replica fails, no recovery is needed. Our goal is to decide which task to checkpoint, which task to replicate, and how much resource should be allocated to each task for the execution of general workflow applications. For that, we first need to derive a clear model for replication, as there are many ways to implement it, even for a single task.

Results for 2016/2017

The initial work for this project has been focused towards using replication as a detection and correction mechanism for Silent Data Corruptions (SDC). Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique.

In this project, replication is combined with checkpointing to enable rollback and recovery when forward recovery is not possible, which occurs when too many replicas are corrupted. The goal is to find the right level of replication (duplication, triplication or more) needed to efficiently detect and correct silent errors at scale. We have provided a detailed analytical study for this framework.

Results for 2017/2018

We have extended these results for platforms subject to both silent and fail-stop errors. Fail-stop errors are immediately detected, unlike silent errors, and replication may also help tolerating such errors.

We have considered two flavors of replication: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck.

We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.

While the previous work had focused on applications for which we can decide at which frequency we can checkpoint, and our aim has been to find the optimal checkpointing period, we have also initiated the study for linear chains of parallel tasks. The aim is then to decide which tasks to checkpoint and/or replicate. In this case, we have provided an optimal dynamic programming algorithm, and an extensive set of simulations to assess (i) in which scenarios checkpointing performs better than replication, or vice-versa; and (ii) in which scenarios the combination of both methods is useful, and to what extent.

Results for 2018/2019

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. We have considered a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. This work has received the best paper award at APDCM 2018, a workshop run in conjunction with IPDPS 2018.

Three JLESC partners, Inria, Riken and UTK, have conducted a study to compare the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) rigid applications, which use a constant number of processors throughout execution; (ii) moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) rid applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.

Visits and meetings

Aurélien Cavelan (INRIA) visited Franck Cappello (ANL) in Chicago for three months (March, April, and May 2016) to initiate the project. Furthermore, we have been meeting regularly in the previous years. In particular, we have been attending the SC conference (November 2016 and November 2017), where we had extensive discussions to make progress. We represented the JLESC at the Inria booth during these conferences.

When not meeting in person, we have stayed in close collaboration through regular Skype meetings, which allowed us to make progress on the project.

Yves Robert (INRIA) made several visits in 2018/2019 to Univ. Tenn. Knoxville, for a total of approximately two months.

Valentin Le Fèvre has visited Univ. Tenn. Knoxville for 10 days in February 2019.

Impact and publications

Two papers have been published in FTXS’17 (Benoit et al. 2017),(Benoit et al. 2017).

The work combining fail-stop and silent errors has been published in JPDC (Benoit et al. 2018).

A work on executing workflows on high-bandwidth memory architectures was published in ICPP’18 (Benoit et al. 2018).

The work on optimal cooperative checkpointing for shared high-performance computing platforms was the best paper at APDCM’18 (Hérault et al. 2018).

Finally, the work studying whether moldable applications perform better on failure-prone HPC platforms was published in Resilience’18 (Fèvre et al. 2018).

  1. Fèvre, Valentin Le, George Bosilca, Aurelien Bouteiller, Thomas Herault, Atsushi Hori, Yves Robert, and Jack Dongarra. 2018. “Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?” In Resilience: 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Jointly Published with Euro-Par 2018. LNCS. Springer Verlag.
    @inproceedings{LeFevreEtAl2018,
      author = {Fèvre, Valentin Le and Bosilca, George and Bouteiller, Aurelien and Herault, Thomas and Hori, Atsushi and Robert, Yves and Dongarra, Jack},
      booktitle = {{Resilience}: 11th Workshop on Resiliency in High Performance Computing
        in Clusters, Clouds, and Grids,
                      jointly published with {Euro-Par 2018}},
      title = {{Do moldable applications perform better
      on failure-prone HPC platforms?}},
      year = {2018},
      series = {LNCS},
      publisher = {Springer Verlag}
    }
    
  2. Hérault, Thomas, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt B. Ferreira, George Bosilca, and Jack Dongarra. 2018. “Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms.” In 20th Workshop on Advances in Parallel and Distributed Computational Models APDCM 2018. IEEE Computer Society Press.
    @inproceedings{HeraultEtAl2018,
      author = {Hérault, Thomas and Robert, Yves and Bouteiller, Aurélien and Arnold, Dorian and Ferreira, Kurt B. and Bosilca, George and Dongarra, Jack},
      booktitle = {20th Workshop on Advances in Parallel and Distributed
                                    Computational Models {APDCM 2018}},
      publisher = {IEEE Computer Society Press},
      title = {Optimal cooperative checkpointing for shared high-performance computing platforms},
      year = {2018}
    }
    
  3. Benoit, Anne, Swann Perarnau, Loïc Pottier, and Yves Robert. 2018. “A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures.” In ICPP’2018, the 47th Int. Conf. on Parallel Processing. IEEE Computer Society Press.
    @inproceedings{BenoitEtAl2018b,
      author = {Benoit, Anne and Perarnau, Swann and Pottier, Loïc and Robert, Yves},
      booktitle = {{ICPP'2018}, the 47th Int. Conf. on Parallel Processing},
      title = {A performance model to execute workflows on high-bandwidth memory architectures},
      publisher = {{IEEE} Computer Society Press},
      year = {2018}
    }
    
  4. Benoit, Anne, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, and Hongyang Sun. 2018. “Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing.” J. Parallel and Distributed Computing.
    @article{BenoitEtAl2018,
      author = {Benoit, Anne and Cavelan, Aurélien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
      journal = {J. Parallel and Distributed Computing},
      title = {Coping with silent and fail-stop errors at scale by combining replication and checkpointing},
      year = {2018},
      badvolume = {98},
      badpages = {8-24}
    }
    
  5. ———. 2017. “Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale.” In Proceedings of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017identifying,
      title = {Identifying the right replication level to detect and correct silent errors at scale},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
      year = {2017},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      keywords = {mine,Workshop}
    }
    
  6. Benoit, Anne, Aurélien Cavelan, Valentin Le Fèvre, and Yves Robert. 2017. “Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.” In Proceedings of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).
    @inproceedings{benoitEtAl2017optimal,
      title = {Optimal checkpointing period with replicated execution on heterogeneous platforms},
      author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Le F{\`e}vre, Valentin and Robert, Yves},
      booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
      year = {2017},
      keywords = {mine,Workshop}
    }
    

Future plans

There remains a lot to explore for workflow applications, consisting of tasks. We have so far focused only on duplication in this case, but one may want to consider different replication levels (duplication, triplication or more) to different tasks, depending upon their criticality in terms of longest paths, number of successors, etc. This may be even more important when considering a general directed acyclic graph of tasks, rather than restricting to linear chains of tasks. This topic is called partial replication, and even though it has been empirically studied by some previous work, designing an optimal strategy that combines partial redundancy and checkpointing and analyzing its efficacy remain to be done.

Finally, our initial goal was to target pipelined workflow applications, where data continuously enters the workflow, and where the objective is to maximize the throughput that can be achieved. This causes several new challenges that we hope to address in the future.

References

  1. Han, Li, Louis-Claude Canon, Henri Casanova, Yves Robert, and Frédéric Vivien. 2018. “Checkpointing Workflows for Fail-Stop Errors.” IEEE Trans. Computers 67 (8): 1105–20.
    @article{HanEtAl2018b,
      author = {Han, Li and Canon, Louis-Claude and Casanova, Henri and Robert, Yves and Vivien, Frédéric},
      journal = {IEEE Trans. Computers},
      volume = {67},
      pages = {1105-1120},
      title = {Checkpointing workflows for fail-stop errors},
      number = {8},
      year = {2018}
    }
    
  2. Han, Li, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. 2018. “A Generic Approach to Scheduling and Checkpointing Workflows.” In ICPP2018, The 47th Int. Conf. on Parallel Processing. IEEE Computer Society Press.
    @inproceedings{HanEtAl2018,
      author = {Han, Li and Fèvre, Valentin Le and Canon, Louis-Claude and Robert, Yves and Vivien, Frédéric},
      booktitle = {ICPP2018, the 47th Int. Conf. on Parallel Processing},
      title = {A Generic Approach to  Scheduling and Checkpointing Workflows},
      publisher = {{IEEE} Computer Society Press},
      year = {2018}
    }
    
  3. Casanova, Henri, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2015. “On the Impact of Process Replication on Executions of Large-Scale Parallel Applications with Coordinated Checkpointing.” Future Generation Comp. Syst. 51: 7–19. doi:10.1016/j.future.2015.04.003.
    @article{CasanovaEtAl2015,
      author = {Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
      title = {On the impact of process replication on executions of large-scale
                     parallel applications with coordinated checkpointing},
      journal = {Future Generation Comp. Syst.},
      volume = {51},
      pages = {7--19},
      year = {2015},
      url = {http://dx.doi.org/10.1016/j.future.2015.04.003},
      doi = {10.1016/j.future.2015.04.003},
      timestamp = {Thu, 31 Mar 2016 15:45:29 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/journals/fgcs/CasanovaRVZ15},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }
    
  4. Herault, Thomas, and Yves Robert. 2015. Fault-Tolerance Techniques for High-Performance Computing. 1st ed. Springer Publishing Company, Incorporated.
    @book{HeraultEtAl2015,
      author = {Herault, Thomas and Robert, Yves},
      title = {Fault-Tolerance Techniques for High-Performance Computing},
      year = {2015},
      isbn = {3319209426, 9783319209425},
      edition = {1st},
      publisher = {Springer Publishing Company, Incorporated}
    }
    
  5. Bougeret, Marin, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. “Using Group Replication for Resilience on Exascale Systems.” IJHPCA 28 (2): 210–24. doi:10.1177/1094342013505348.
    @article{BougeretEtAl2014,
      author = {Bougeret, Marin and Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
      title = {Using group replication for resilience on exascale systems},
      journal = {{IJHPCA}},
      volume = {28},
      number = {2},
      pages = {210--224},
      year = {2014},
      url = {http://dx.doi.org/10.1177/1094342013505348},
      doi = {10.1177/1094342013505348},
      timestamp = {Mon, 02 Jun 2014 09:36:01 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijhpca/BougeretCRVZ14},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }
    
  6. Ferreira, Kurt B., Jon Stearley, James H. Laros III, Ron Oldfield, Kevin T. Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, and Dorian C. Arnold. 2011. “Evaluating the Viability of Process Replication Reliability for Exascale Systems.” In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, 44:1–44:12. doi:10.1145/2063384.2063443.
    @inproceedings{FerreiraEtAl2011,
      author = {Ferreira, Kurt B. and Stearley, Jon and III, James H. Laros and Oldfield, Ron and Pedretti, Kevin T. and Brightwell, Ron and Riesen, Rolf and Bridges, Patrick G. and Arnold, Dorian C.},
      title = {Evaluating the viability of process replication reliability for exascale
                     systems},
      booktitle = {Conference on High Performance Computing Networking, Storage and Analysis,
                     {SC} 2011, Seattle, WA, USA, November 12-18, 2011},
      pages = {44:1--44:12},
      year = {2011},
      crossref = {DBLP:conf/sc/2011},
      url = {http://doi.acm.org/10.1145/2063384.2063443},
      doi = {10.1145/2063384.2063443},
      timestamp = {Tue, 30 Jun 2015 16:34:04 +0200},
      biburl = {http://dblp.uni-trier.de/rec/bib/conf/sc/FerreiraSLOPBRBA11},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }