New Techniques to Design Silent Data Corruption Detectors

Research topic and goals

With the increase in number of components and the power constrains for next generation supercomputers, reliability is one of the mayor concerns that needs to be addressed to reach exascale. In addition to fail-stop failures, which produce application crashes, silent data corruption (SDC) is considered one of the most dangerous type of errors, because they can make applications produce wrong results without any notice. SDC are still not well understood, their frequency and occurrence patterns remain mostly unknown. Detecting SDC is challenging due to the silent nature of those errors. Our objective is to develop novel software-level strategies that can detect most SDC occurring in future supercomputers.

Results for 2015/2016

During 2015/2016 we tackled several challenges related to silent corruption (Bautista-Gomez and Cappello 2015) and tested preliminary approaches (Di, Berrocal, and Cappello 2015). We developed a lightweight, adaptive, impact-driven detector for detecting the silent data corruptions (Di and Cappello 2016). (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Another question we answered was related to the use of multiple SDC detectors with different characteristics. We designed a strategy that allows users to decide which SDC detector use and at which frequency, having the choice among several SDC detectors with different performance, recall and precision. This research was published as a full paper at HiPC2015 (Bautista-Gomez et al. 2015) and we wrote an extension full of details and new result and published it (Bautista-Gomez et al. 2016). In addition, we investigated how to create SDC detectors based on support vector machines (SVM) and compare them with the state-of-the-art SDC detectors. Our evaluation with multiple large-scale HPC applications shows that SVM is a good technique that can learn about the behavior of the datasets and detect the vast majority of anomalies while imposing a negligible overhead. Our technique performs better than the existent ones in most of the cases. We published this results at CCGrid2016 (Subasi et al. 2016).

Results for 2016/2017

During 2016/2017 we have implemented a new SDC detection algorithm that leverages multiple regression and detection mechanisms in a dynamic fashion in order to better adapt to the conditions of the execution. Indeed, as the data entropy changes over time and space, as turbulent regions moves in the domain, dynamic SDC detection techniques are a much more robust way to adapt for such changes. Our evaluation shows that the proposed technique achieves a lower false positive rate and similar recall, while imposing a much lower memory overhead than state-of-the-art techniques. This work has been submitted to an international conference and it is currently under review. The software prototype, called MACORD, is pending approval for open source publication. In addition, several separate efforts across have been made in order to test selective replication for dealing with SDC in HPC systems (Subasi et al. 2017), in particular for task-based programming languages (Subasi et al. 2016).

Results for 2017/2018

A comprehensive study of support vector machines capabilities on detecting silent data corruption was published as part of the sustainable computing journal (Subasi et al. 2018). This is an extension and aggregation of several research efforts done within this project. After all the results produced by this project, most of the partners in the project started working on different topics, some of them related to resilience others in other domains. Some of these new interests formed new collaboration projects within the JLESC. Thus, it was decided to suspend the project and close it after six months. However, due to renewed interest on the project with fresh ideas, the project was restarted. Currently, BSC and RIKEN are investigating energy-free alternatives to SDC quantification in large scale supercomputers.

Results for 2018/2019

We developed a neural network based detector (Wang et al. 2018).
Compared to state of the art generic SDC detectors, this detector can detect SDCs multiple iterations after they were injected. We have evaluated our detector with 6 FLASH applications and 2 Mantevo mini-apps. Experiments show that our detector can detect more than 89% of SDCs with a false positive rate of less than 2%.

Results for 2019/2020

Most of the partners of this project started working in other projects and there has not been any recent update/work done on this topic. Therefore, we are suspending the project and potentially mark it as finished in 6 months if no further comments/petitions are raised by any of the partners.

Visits and meetings

  • August 2nd, 2015 - November 6th, 2015: Omer internship at ANL
  • July 4th, 2016 - July 8th, 2016: Franck visit to BSC
  • 2018: multiple visits of Franck at UIUC
  • 2019: Franck visit to BSC

Impact and publications

  1. Wang, Chen, Nikoli Dryden, Franck Cappello, and Marc Snir. 2018. “Neural Network Based Silent Error Detector.” In IEEE International Conference on Cluster Computing, CLUSTER 2018, Belfast, UK, September 10-13, 2018, 168–78.
    @inproceedings{clusterWangDCS18,
      author = {Wang, Chen and Dryden, Nikoli and Cappello, Franck and Snir, Marc},
      title = {Neural Network Based Silent Error Detector},
      booktitle = {{IEEE} International Conference on Cluster Computing, {CLUSTER} 2018,
                     Belfast, UK, September 10-13, 2018},
      pages = {168--178},
      year = {2018}
    }
    
  2. Subasi, Omer, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2018. “Exploring the Capabilities of Support Vector Machines in Detecting Silent Data Corruptions .” Sustainable Computing: Informatics and Systems . doi:https://doi.org/10.1016/j.suscom.2018.01.004.
    @article{Subasi2018,
      title = {Exploring the Capabilities of Support Vector Machines in Detecting Silent Data Corruptions },
      journal = {Sustainable Computing: Informatics and Systems },
      volume = {},
      number = {},
      year = {2018},
      note = {},
      issn = {2210-5379},
      doi = {https://doi.org/10.1016/j.suscom.2018.01.004},
      url = {https://www.sciencedirect.com/science/article/pii/S2210537917300896},
      author = {Subasi, Omer and Di, Sheng and Bautista-Gomez, Leonardo and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Krishnamoorthy, Sriram and Cappello, Franck},
      keywords = {HPC Applications }
    }
    
  3. Subasi, Omer, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal, and Jesus Labarta. 2017. “Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications.” In 2017 IEEE International Conference on Cluster Cloud and Grid Computing (CCGrid’17). IEEE.
    @inproceedings{subasi2017rep,
      title = {Designing and Modelling Selective Replication for Fault-tolerant HPC Applications},
      author = {Subasi, Omer and Yalcin, Gulay and Zyulkyarov, Ferad and Unsal, Osman and Labarta, Jesus},
      booktitle = {2017 IEEE International Conference on Cluster Cloud and Grid Computing (CCGrid'17)},
      year = {2017},
      organization = {IEEE}
    }
    
  4. Di, Sheng, and Franck Cappello. 2016. “ Adaptive-Impact Driven Detection of Silent Data Corruption for HPC Applications.” IEEE Transactions on Parallel and Distributed Computing. Phoenix, United States.
    @article{ShengEtCappello2016,
      address = {Phoenix, United States},
      author = {Di, Sheng and Cappello, Franck},
      booktitle = {IEEE Transactions on Parallel and Distributed Computing},
      title = { Adaptive-Impact Driven Detection of Silent Data Corruption for HPC Applications},
      year = {2016}
    }
    
  5. Subasi, Omer, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, and Franck Cappello. 2016. “Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era.” In Proceedings of the 2016 IEEE/ACM International Symposium on Cluster Cloud And Grid Computing. IEEE.
    @inproceedings{SubasiEtAl2016,
      title = {Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era},
      author = {Subasi, Omer and Di, Sheng and Bautista-Gomez, Leonardo and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Cappello, Franck},
      booktitle = {Proceedings of the 2016 IEEE/ACM International Symposium on Cluster Cloud and
            Grid Computing},
      organization = {IEEE},
      year = {2016}
    }
    
  6. Bautista-Gomez, Leonardo, Anne Benoit, Aurélien Cavelan, Saurabh K Raina, Yves Robert, and Hongyang Sun. 2016. “Coping with Recall and Precision of Soft Error Detectors.” Journal of Parallel and Distributed Computing 98. Elsevier: 8–24.
    @article{bautista2016coping,
      title = {Coping with recall and precision of soft error detectors},
      author = {Bautista-Gomez, Leonardo and Benoit, Anne and Cavelan, Aur{\'e}lien and Raina, Saurabh K and Robert, Yves and Sun, Hongyang},
      journal = {Journal of Parallel and Distributed Computing},
      volume = {98},
      pages = {8--24},
      year = {2016},
      publisher = {Elsevier}
    }
    
  7. Subasi, Omer, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal, and Jesus Labarta. 2016. “A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets.” In 2016 IEEE International Conference on Cluster Computing (CLUSTER’16), 498–505. IEEE.
    @inproceedings{subasi2016run,
      title = {A runtime heuristic to selectively replicate tasks for application-specific reliability targets},
      author = {Subasi, Omer and Yalcin, Gulay and Zyulkyarov, Ferad and Unsal, Osman and Labarta, Jesus},
      booktitle = {2016 IEEE International Conference on Cluster Computing (CLUSTER'16)},
      pages = {498--505},
      year = {2016},
      organization = {IEEE}
    }
    
  8. Bautista-Gomez, Leonardo, Anne Benoit, Aurélien Cavelan, Saurabh K Raina, Yves Robert, and Hongyang Sun. 2015. “Which Verification for Soft Error Detection?” In Proceedings of the 24th International Conference on High-Performance Performance Computing. IEEE.
    @inproceedings{BautEtAl2015b,
      title = {Which Verification for Soft Error Detection?},
      author = {Bautista-Gomez, Leonardo and Benoit, Anne and Cavelan, Aur{\'e}lien and Raina, Saurabh K and Robert, Yves and Sun, Hongyang},
      year = {2015},
      booktitle = {Proceedings of the 24th International Conference on High-Performance Performance Computing},
      organization = {IEEE}
    }
    
  9. Bautista-Gomez, Leonardo Arturo, and Franck Cappello. 2015. “Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation.” In Proceedings of the 2015 IEEE International Conference on Cluster Computing, 595–602. IEEE Computer Society.
    @inproceedings{BautEtAl2015,
      title = {Detecting and correcting data corruption in stencil applications through multivariate interpolation},
      author = {Bautista-Gomez, Leonardo Arturo and Cappello, Franck},
      booktitle = {Proceedings of the 2015 IEEE International Conference on Cluster Computing},
      pages = {595--602},
      year = {2015},
      organization = {IEEE Computer Society}
    }
    
  10. Di, Sheng, Eduardo Berrocal, and Franck Cappello. 2015. “An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications.” In 2015 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’15), 271–80. IEEE.
    @inproceedings{di2015detect,
      title = {An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications},
      author = {Di, Sheng and Berrocal, Eduardo and Cappello, Franck},
      booktitle = {2015 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'15)},
      pages = {271--280},
      year = {2015},
      organization = {IEEE}
    }
    

Future plans

No future plans.

References