Automatic Optimization of Multi-Physics Coupled Executions through Closed-Loop Resource Arbitration

Research topic and goals

Many complex simulation problems are tackled by coupling independent domain-specific physics solvers; e.g. fluid-structure simulations, or weather simulations combining models at different scales. Enabling efficient multi-code executions facilitates code reuse and is a natural way of increasing simulation capabilities of a specific framework. However, being the efficient exploitation of modern computing architectures already a challenging problem, combining different codes on a coupled execution can become even more complex. One of the problems is the distribution of resources among the various physics software that compose the model. A non-perfect distribution would result in some idle time, implying a waste of computational resources. For staggered coupling schemes, where the physics software are not running concurrently, the computing resources should be transferred from one physics software to the other. For monolithic or implicit approaches, a balance needs to be found so that both software instances spend the same time to reach the communication stage of the coupling. There are different strategies to be applied to achieve these goals, most of them relying on the malleability of the software being coupled.

In this project we aim to investigate strategies to increase the efficiency of coupled executions on modern architectures. The investigation will be demonstrated on coupled executions with the multi-physiscs framework Alya, developed at the Computer Applications for Science and Engineering (CASE) department of the Barcelona Supercomputing Center (BSC). For the resource arbitration, we will be using Argonne’s Node Resource Manager, an infrastructure to partition resources among application components using control-theoretical methods. This infrastructure combine application monitoring (PMPI, performance counters) with multi-armed-bandit strategies to automatically learn the performance-resource sweet spot for a collection of application components. As such, it relies on accurate monitoring capabilities and the right set of actuators (resource arbitration mechanisms), that we will explore in depth for this set of use cases.

Future Plans

We are in the process of identifying proper signals (MPI communications) that can be used to identify load imbalance and trigger load balancing.

We expect to have a working solution for subsets of the Alya workloads within a year.

Visits and meetings

The projects members met during previous JLESC meetings and at Supercomputing 2019.

Impact and publications

    References

    1. Borrell, R., D. Dosimont, M. Garcia-Gasulla, G. Houzeaux, O. Lehmkuhl, V. Mehta, H. Owen, M. Vázquez, and G. Oyarzun. 2020. “Heterogeneous CPU/GPU Co-Execution of CFD Simulations on the POWER9 Architecture: Application to Airplane Aerodynamics.” Future Generation Computer Systems 107: 31–48.
      @article{BORRELL202031,
        title = {Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics},
        journal = {Future Generation Computer Systems},
        volume = {107},
        pages = {31 - 48},
        year = {2020},
        author = {Borrell, R. and Dosimont, D. and Garcia-Gasulla, M. and Houzeaux, G. and Lehmkuhl, O. and Mehta, V. and Owen, H. and Vázquez, M. and Oyarzun, G.}
      }
      
    2. Cajas, J., Guillaume Houzeaux, Mariano Vázquez, Marta Garcia-Gasulla, E. Casoni, Hadrien Calmet, Antoni Artigues, et al. 2018. “Fluid-Structure Interaction Based on HPC Multicode Coupling.” SIAM Journal on Scientific Computing 40 (January): C677–C703.
      @article{cajas,
        author = {Cajas, J. and Houzeaux, Guillaume and Vázquez, Mariano and Garcia-Gasulla, Marta and Casoni, E. and Calmet, Hadrien and Artigues, Antoni and Borrell, R. and Lehmkuhl, Oriol and Maldoando, Daniel and Yán͂ez, D. and Pons, R. and Martorell, J.},
        year = {2018},
        month = jan,
        pages = {C677-C703},
        title = {Fluid-Structure Interaction Based on HPC Multicode Coupling},
        volume = {40},
        journal = {SIAM Journal on Scientific Computing}
      }
      
    3. Vázquez, M., G. Houzeaux, S. Koric, A. Artigues, J. Aguado-Sierra, Arı́s R., D. Mira, et al. 2015. “Alya: Multiphysics Engineering Simulation Towards Exascale.” J. Comput. Sci.
      @article{VazquezEtAl2015,
        author = {V\'azquez, M. and Houzeaux, G. and Koric, S. and Artigues, A. and Aguado-Sierra, J. and Ar\'{\i}s, R. and Mira, D. and Calmet, H. and Cucchietti, F. and Owen, H. and Taha, A. and Burness, E.D. and Cela, J.M. and Valero, M.},
        journal = {J. Comput. Sci.},
        keywords = {Alya},
        title = {Alya: Multiphysics Engineering Simulation Towards Exascale},
        year = {2015}
      }