Reconfiguring Distributed Storage Systems on HPC infrastructures

Research topic and goals

As parallel file systems reach their limits in HPC, an alternative approach is emerging: the creation of data services tailored to the needs of the application that uses it (Dorier et al. 2018). In order to do so, a set of building blocks for data services (e.g. membership service, key-value store, …) has been developed as part of the Mochi project (https://www.mcs.anl.gov/research/projects/mochi/). In particular, some applications (typically workflows) have requirements in number of machines that vary over time, thus being able to rescale efficiently can be needed for data services co-deployed with malleable applications. Preliminary work by Cheriere et al. showed that the reconfiguration could be done in a short time when the amount of data per node is balanced. Our work in this context is to investigate the use of rescaling of distributed storage systems in HPC environments.

Results for 2017/2018

We modelled the duration of the commision and decommission operations, for which we obtained theoretical lower bounds. Then we considered HDFS as a use case and we show that our model can explain the measured commission and decommission times. The existing decommission mechanism of HDFS is good when the network is the bottleneck, but could be accelerated by up to a factor 3 when the storage is the limiting factor. We also show that commission in HDFS can largely be improved. The results on theoretical decommission time have been published at the IEEE BigData 2017 conference (Cheriere and Antoniu 2017). Results for the commission time have later been added and an extended paper has been submitted and is under review for Elsevier JPDC. These additional results are independently available as a research report (Cheriere, Dorier, and Antoniu 2018).

Results for 2018/2019

We introduced Pufferbench (Cheriere, Dorier, and Antoniu 2018), a benchmark for evaluating how fast one can scale up and down a distributed storage system on a given infrastructure and, thereby, how viably can one implement storage malleability on it. Besides, it can serve to quickly prototype and evaluate mechanisms for malleability in existing distributed storage systems. We validate Pufferbench against theoretical lower bounds for commission and decommission: it can achieve performance within 16% of them. We use Pufferbench to evaluate in practice these operations in HDFS: commission in HDFS could be accelerated by as much as 14 times! Our results show that: (1) the lower bounds for commission and decommission times we previously established are sound and can be approached in practice; (2) HDFS could handle these operations much more efficiently; most importantly, (3) malleability in distributed storage systems is viable and should be further leveraged for Big Data applications.

We also studied from a theoretical point of view the potential opportunities provided by relaxing fault tolerance during decommission operations. Results of this work are available in a research report (Cheriere, Dorier, and Antoniu 2018) and have been submitted to IEEE/ACM CCgrid 2019.

Furthermore, we focused on understanding the requirements of distributed storage systems co-deployed with HPC-applications, designed a rescaling mechanism able to meet these requirements, and implemented it.

Visits and meetings

Nathanael Cheriere visited Argonne National laboratory from September to December 2018 (2.5 months) and developed Pufferscale, a rescaling scheduler that keeps the load balanced across the nodes, while ensuring the speed and stability of the rescaling operations.

Impact and publications

  1. Cheriere, Nathanaël, Matthieu Dorier, and Gabriel Antoniu. 2018. “A Lower Bound for the Commission Times in Replication-Based Distributed Storage Systems.” Research Report RR-9186. Inria Rennes - Bretagne Atlantique. https://hal.archives-ouvertes.fr/hal-01817638.
    @techreport{Cheriere2018LowerCommission,
      title = {{A Lower Bound for the Commission Times in Replication-Based Distributed Storage Systems}},
      author = {Cheriere, Nathana{\"e}l and Dorier, Matthieu and Antoniu, Gabriel},
      url = {https://hal.archives-ouvertes.fr/hal-01817638},
      type = {Research Report},
      number = {RR-9186},
      pages = {1-26},
      institution = {{Inria Rennes - Bretagne Atlantique}},
      year = {2018},
      month = jun,
      keywords = {Commission ; Elastic Storage ; Distributed File System ; Malleable File System ; Lower Bound},
      pdf = {https://hal.archives-ouvertes.fr/hal-01817638/file/RR-9186.pdf},
      hal_id = {hal-01817638},
      hal_version = {v2}
    }
    
  2. ———. 2018. “Pufferbench: Evaluating and Optimizing Malleability of Distributed Storage.” In PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, 1–10. Dallas, United States. https://hal.archives-ouvertes.fr/hal-01892713.
    @inproceedings{Cheriere2018Pufferbench,
      title = {{Pufferbench: Evaluating and Optimizing Malleability of Distributed Storage}},
      author = {Cheriere, Nathana{\"e}l and Dorier, Matthieu and Antoniu, Gabriel},
      url = {https://hal.archives-ouvertes.fr/hal-01892713},
      booktitle = {{PDSW-DISCS 2018: 3rd Joint International workshop on Parallel Data Storage \& Data Intensive Scalable computing Systems}},
      address = {Dallas, United States},
      pages = {1-10},
      year = {2018},
      month = nov,
      keywords = {Distributed Storage System Malleability ; Benchmark ; Pufferbench},
      pdf = {https://hal.archives-ouvertes.fr/hal-01892713/file/Paper.pdf},
      hal_id = {hal-01892713},
      hal_version = {v1}
    }
    
  3. ———. 2018. “Lower Bounds for the Duration of Decommission Operations with Relaxed Fault Tolerance in Replication-Based Distributed Storage Systems.” Research Report RR-9229. Inria Rennes - Bretagne Atlantique. https://hal.archives-ouvertes.fr/hal-01943964.
    @techreport{Cheriere2018LowerRelaxed,
      title = {{Lower Bounds for the Duration of Decommission Operations with Relaxed Fault Tolerance in Replication-based Distributed Storage Systems}},
      author = {Cheriere, Nathana{\"e}l and Dorier, Matthieu and Antoniu, Gabriel},
      url = {https://hal.archives-ouvertes.fr/hal-01943964},
      type = {Research Report},
      number = {RR-9229},
      pages = {1-28},
      institution = {{Inria Rennes - Bretagne Atlantique}},
      year = {2018},
      month = dec,
      keywords = {Distributed Storage Systems ; Malleable Storage ; Fault Tolerance ; Elastic Storage ; Syst{\`e}me de stockage distribu{\'e} ; Stockage {\'e}lastique ; Stockage mall{\'e}able ; D{\'e}commission ; Tol{\'e}rance aux pannes},
      pdf = {https://hal.archives-ouvertes.fr/hal-01943964/file/Report.pdf},
      hal_id = {hal-01943964},
      hal_version = {v2}
    }
    
  4. Cheriere, Nathanaël, and Gabriel Antoniu. 2017. “How Fast Can One Scale Down a Distributed File System?” In BigData. Boston, United States. doi:10.1109/BigData.2017.8257922.
    @inproceedings{Cheriere2017How,
      title = {{How Fast Can One Scale Down a Distributed File System?}},
      author = {Cheriere, Nathana{\"e}l and Antoniu, Gabriel},
      url = {https://hal.archives-ouvertes.fr/hal-01644928},
      booktitle = {{BigData}},
      address = {Boston, United States},
      year = {2017},
      month = dec,
      doi = {10.1109/BigData.2017.8257922},
      keywords = {Decommission ; Model ; Malleable File System ; Distributed File System ; Elastic Storage},
      pdf = {https://hal.archives-ouvertes.fr/hal-01644928/file/ModelingDecommision.pdf},
      hal_id = {hal-01644928},
      hal_version = {v1}
    }
    

Future plans

We intend to explore how transient storage systems can be leveraged to support elastic in situ analytics.

References

  1. Dorier, Matthieu, Philip Carns, Kevin Harms, Robert Latham, Robert Ross, Shane Snyder, Justin Wozniak, et al. 2018. “Methodology for the Rapid Development of Scalable HPC Data Services.” Workshop. In Proceedings of the PDSW-DISC 2018 Workshop (SC18).
    @inproceedings{Dorier2018Methodology,
      title = {{Methodology for the Rapid Development of Scalable HPC Data Services}},
      author = {Dorier, Matthieu and Carns, Philip and Harms, Kevin and Latham, Robert and Ross, Robert and Snyder, Shane and Wozniak, Justin and Gutierrez, Samuel and Robey, Bob and Settlemyer, Brad and Shipman, Galen and Soumagne, Jerome and Kowalkowski, James and Paterno, Marc and Sehrish, Saba},
      booktitle = {{Proceedings of the PDSW-DISC 2018 workshop (SC18)}},
      year = {2018},
      type = {workshop},
      url = {},
      pdf = {}
    }