Exploiting Active Storage for Resilience

Research topic and goals

The research in this topic is based on a hardware and a software architecture, which both are currently under development: GVR and BGAS. GVR (Global View Resilience) is a user-level library that enables portable, efficient, application-controlled resilience (Chien et al. 2015). It focusses on achieving scalability and maximization of error recovery. BGAS (Blue Gene Active Storage) is a realisation of an active storage architecture based on custom flash memory cards which are integrated into Blue Gene/Q I/O drawers. Here JSC continues previous work on integration of non-volatile memory (Sayed et al. 2013). In this subproject our goal is to explore the opportunities of both architectures by integrating them. More specifically the following research questions are addressed:

  • How well can the software architecture of GVR exploit the BGAS hardware architecture?
  • How efficiently can both architectures be exploited?
  • What is the value of active storage for a which classes of large-scale scientific computing?

Results for 2015/2016

GVR has been successfully installed by the ANL team on Jülich’s BG/Q system JUQUEEN exploiting the attached BGAS nodes. This setup had been the basis for an extensive performance analysis, where results will be published at ISC16.

The conclusions was that the NVM-based BGAS system provides a more efficient basis and opportunities for GVR versioning comparing to an traditional external storage systems attached to the same system, especially for flexible error recovery using random version access. Equipped with additional compute resource, e.g., idle cores on the I/O node (ION), in-situ analysis could be off-loaded to the ION. Such active storage concepts can potentially be exploited for enabling algorithm-based fault-tolerance (ABFT) error- checking. Further performance improvements might be attainable using the Direct Storage Access (DSA) interface instead of the local file system that was used within this project.

With the presentation of the paper this project will come to an end.

Visits and meetings

Beyond regular contacts via email the following meetings involving most of the participants took place:

  • Meeting of Andrew A. Chien (ANL), Nan Dun (ANL) and Dirk Pleiter (JSC) at SC14 on November 17, 2014.
  • Technical update meeting on February 5, 2015.
  • Regular technical meetings thereafter until December 2015

Impact and publications

The project has submitted a publication to ISC16 that has been accepted.

    Future plans

    None

    References

    1. Chien, A., P. Balaji, P. Beckman, N. Dun, A. Fang, H. Fujita, K. Iskra, et al. 2015. “Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience.” Procedia Computer Science 51: 29–38. doi:10.1016/j.procs.2015.05.187.
      @article{ChienEtAl2015,
        author = {Chien, A. and Balaji, P. and Beckman, P. and Dun, N. and Fang, A. and Fujita, H. and Iskra, K. and Rubenstein, Z. and Zheng, Z. and Schreiber, R. and Hammond, J. and Dinan, J. and Laguna, I. and Richards, D. and Dubey, A. and van Straalen, B. and Hoemmen, M. and Heroux, M. and Teranishi, K. and Siegel, A.},
        doi = {10.1016/j.procs.2015.05.187},
        journal = {Procedia Computer Science},
        note = {International Conference On Computational Science,
            {ICCS} 2015 Computational Science at the Gates of Nature},
        pages = {29 - 38},
        title = {Versioned Distributed Arrays for Resilience in Scientific Applications: Global View
            Resilience},
        volume = {51},
        year = {2015}
      }
      
    2. Sayed, Salem, Stephan Graf, Michael Hennecke, Dirk Pleiter, Georg Schwarz, Heiko Schick, and Michael Stephan. 2013. “Supercomputing: 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany, June 16-20, 2013. Proceedings.” In , edited by Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer, 435–46. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-38750-0_33.
      @inbook{SayedEtAl2013,
        address = {Berlin, Heidelberg},
        author = {Sayed, Salem and Graf, Stephan and Hennecke, Michael and Pleiter, Dirk and Schwarz, Georg and Schick, Heiko and Stephan, Michael},
        chapter = {Using GPFS to Manage NVRAM-Based Storage Cache},
        doi = {10.1007/978-3-642-38750-0_33},
        editor = {Kunkel, Julian Martin and Ludwig, Thomas and Meuer, Hans Werner},
        pages = {435--446},
        publisher = {Springer Berlin Heidelberg},
        title = {Supercomputing: 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany,
            June 16-20, 2013. Proceedings},
        year = {2013}
      }