A Comprehensive Software Stack for Dynamic Resources Management.

Research topic and goals

Dynamic Resource Management (DRM) allows for dynamic changes of the resources assigned to a job during its execution. DRM has gained considerable interest over the last years as it could provide many benefits to providers of HPC systems and their users, such as improving energy efficiency and throughput.

However, DRM requires changes and interactions throughout the whole HPC software stack. This project aims to integrate different layers of the HPC software stack, including the OAR job and resource manager system, the Dynamic Processes with PSets (DPP) approach, and the Dynamic Management of Resources (DMR) programming interface.

OAR RJMS introduces the job envelope to support DRM. As resources are requested or freed, a new RJMS job is created and linked to the initial job envelope, replacing any previous job. This creates a sequence of jobs with different associated resources, forming a virtual global dynamic job.

Dynamic Processes with PSets (DPP) (Huber et al. 2024) is a set of design paradigms for generic dynamic resource support in parallel programming models deduced from prior work. The DPP design paradigms are based on a system-application co-design and aim for a flexible and programming model agnostic abstraction. As a proof-of-concept, the DPP paradigms have been realized in a prototype based on Open-MPI, OpenPMIx, and PRRTE.

The Dynamic Management of Resources (DMR) framework (Iserte et al. 2021) is a high-level API that facilitates the adoption of dynamicity in HPC codes. Particularly, DMR can abstract the use of different MPI dynamic solutions into the same syntax. In a nutshell, for instance, in an iterative code, DMR provides a series of operations around the main loop, which makes all the dynamic logic transparent to the user behind the scenes.

Once provided with the DRM framework, several scientific applications and benchmarks will be updated with the dynamic resources paradigm and evaluated regarding coding usability and reconfiguration performance.

This project aims to:

  • Integration of DPP and DMR in OAR.
  • Provide a friendly programming layer for the DRM software stack.
  • Create a collection of dynamic applications with a common interface.
  • Evaluate coding usability and performance of the new dynamic resources management approach compared to the current state-of-the-art.

Impact and publications

None yet.

Future plans

None yet.

References

  1. Huber, Dominik, Martin Schreiber, Martin Schulz, Howard Pritchard, and Daniel Holmes. 2024. “Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models.”
    @misc{huber2024design,
      title = {Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models},
      author = {Huber, Dominik and Schreiber, Martin and Schulz, Martin and Pritchard, Howard and Holmes, Daniel},
      year = {2024},
      eprint = {2403.17107},
      archiveprefix = {arXiv},
      primaryclass = {cs.DC}
    }
    
  2. Iserte, Sergio, Rafael Mayo, Enrique S. Quintana-Ortí, and Antonio J. Peña. 2021. “DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.” IEEE Transactions on Computers 70 (9): 1443–57. https://doi.org/10.1109/TC.2020.3022933.
    @article{iserte_dmrlib_2021,
      title = {{DMRlib}: {Easy}-{Coding} and {Efficient} {Resource} {Management} for {Job} {Malleability}},
      volume = {70},
      copyright = {All rights reserved},
      issn = {1557-9956},
      shorttitle = {{DMRlib}},
      url = {https://ieeexplore.ieee.org/document/9190024},
      doi = {10.1109/TC.2020.3022933},
      number = {9},
      urldate = {2024-01-23},
      journal = {IEEE Transactions on Computers},
      author = {Iserte, Sergio and Mayo, Rafael and Quintana-Ortí, Enrique S. and Peña, Antonio J.},
      month = sep,
      year = {2021},
      note = {Conference Name: IEEE Transactions on Computers},
      pages = {1443--1457},
      file = {Full Text:C\:\\Users\\siser\\Zotero\\storage\\7H5IJ6XY\\Iserte et al. - 2021 - DMRlib Easy-Coding and Efficient Resource Managem.pdf:application/pdf}
    }
    
    Process malleability has proved to have a highly positive impact on the resource utilization and global productivity in data centers compared with the conventional static resource allocation policy. However, the non-negligible additional development effort this solution imposes has constrained its adoption by the scientific programming community. In this work, we present DMRlib, a library designed to offer the global advantages of process malleability while providing a minimalist MPI-like syntax. The library includes a series of predefined communication patterns that greatly ease the development of malleable applications. In addition, we deploy several scenarios to demonstrate the positive impact of process malleability featuring different scalability patterns. Concretely, we study two job submission modes (rigid and moldable) in order to identify the best-case scenarios for malleability using metrics such as resource allocation rate, completed jobs per second, and energy consumption. The experiments prove that our elastic approach may improve global throughput by a factor higher than 3x compared to the traditional workloads of non-malleable jobs.