Toward taming large and complex data flows in data-centric supercomputing

Research topic and goals

Data-centric supercomputing is of increasing importance to meet national and international scientific missions and are becoming an integral part of many traditional scientific computational science domains including climate, cosmology, engineering, combustion, and astrophysics. These applications require the ability to rapidly and reliably compute, move, and manage large amounts of data through a deep and complex interconnect and memory architectures having diverse sources and destinations, including scientific instruments, storage systems, supercomputers, and analysis systems.

Understanding, characterizing, and transforming application data flows over a range of architectures is a difficult task that requires significant time. The difficulty lies in effectively abstracting the system architecture, relating the complex organization of the system architecture with the data movement requirements in both space and time, such that the information can be used to explore tradeoffs (e.g., power consumption versus execution time); and transformations and mapping that may result in better performance (e.g., moving compute to the data or process mapping heuristics). Moreover, because of budget constraints, we are witnessing these infrastructures being shared by diverse, concurrent applications including those that require data-intensive flows. New approaches are required for taking us to the next level in understanding interactions between system infrastructure and application data flows at extreme scales.

We propose an integrated research program bringing together experts in network science and modeling, application modeling, high-performance runtime system software, and algorithm design to address these issues and improve understanding of the relationships between data-centric application flows and architecture features of future systems. At the heart of this program is a framework for modeling and abstracting the resource characteristics (e.g., topology and nonvolatile memory), abstracting application data flow behavior, including quality of service, I/O, communications, and developing cross-layer optimal transformations for mapping flows effectively to underlying resources.

Results for 2015/2016

We have worked on modeling the problem : communication cost, aggregator placement, I/O phases, etc. We have proposed placement strategies but they need to be refined and tested. A preliminary implementation is being developed in a cosmology code called HACC I/O.

Results for 2016/2017

We have taken into account the network topology for mapping aggregators and we propose an optimized buffering system in order to reduce the aggregation cost. We have validated our approach using micro-benchmarks and the I/O kernel of a large-scale cosmology simulation. We have showed improvements up to 15x faster for I/O operations compared to a standard implementation of MPI I/O.

Results for 2017/2018

We have developed TAPIOCA, an MPI-based library implementing an efficient topology-aware two-phase I/O algorithm. TAPIOCA can take advantage of double-buffering and one-sided communication to reduce as much as possible the idle time during data aggregation. We validate our approach at large scale on two leadership-class supercomputers: Mira (IBM BG/Q) and Theta (Cray XC40). On both architectures, we show a substantial improvement of I/O performance compared with the default MPI I/O implementation.

Visits and meetings

  • Emmanuel Jeannot visited ANL on March 2015
  • François Tessier stayed 10 days at ANL on March 2015
  • Emmanuel Jeannot visited ANL on June 2016
  • François Tessier visited Inira on December 2016
  • Emmanuel Jeannot and Guillaume Aupy visited ANL on July 2017

Impact and publications

François Tessier moved from Inria to ANL in February 2016. A part of his work is focused on this project. Results have been published in the 1st Workshop on Optimization of Communication in HPC runtime systems (IEEE COM-HPC16), in conjunction with SuperComputing 2016 (Tessier et al. 2016). We have published our work on Tapioca in Cluster 2017 (Tessier, Vishwanath, and Jeannot 2017).

  1. Tessier, François, Venkatram Vishwanath, and Emmanuel Jeannot. 2017. “TAPIOCA: An I/O Library For Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers.” In Cluster Computing (CLUSTER), 2017 IEEE International Conference On, 70–80. IEEE.
    @inproceedings{tvj17,
      title = {TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers},
      author = {Tessier, Fran{\c{c}}ois and Vishwanath, Venkatram and Jeannot, Emmanuel},
      booktitle = {Cluster Computing (CLUSTER), 2017 IEEE International Conference on},
      pages = {70--80},
      year = {2017},
      organization = {IEEE}
    }
    
  2. Tessier, François, Preeti Malakar, Venkatram Vishwanath, Emmanuel Jeannot, and Florin Isaila. 2016. “Topology-Aware Data Aggregation For Intensive I/O on Large-Scale Supercomputers.” In 1st Workshop On Optimization of Communication in HPC Runtime Systems (IEEE COM-HPC16). Salt-Lake City, United States: IEEE. https://hal.inria.fr/hal-01394741.
    @inproceedings{tmv+16,
      title = {{Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers}},
      author = {Tessier, Fran{\c c}ois and Malakar, Preeti and Vishwanath, Venkatram and Jeannot, Emmanuel and Isaila, Florin},
      url = {https://hal.inria.fr/hal-01394741},
      booktitle = {{1st Workshop on Optimization of Communication in HPC runtime systems (IEEE COM-HPC16)}},
      address = {Salt-Lake City, United States},
      publisher = {{IEEE}},
      year = {2016},
      month = nov,
      pdf = {https://hal.inria.fr/hal-01394741/file/topoIO-paper.pdf},
      hal_id = {hal-01394741},
      hal_version = {v1}
    }
    

Future plans

The next step will be to determine the parameters to consider to compute a near-optimal number of aggregators. Additionally, we plan to implement the topology-aware aggregator placement once a stable version of data aggregation will be developed. We want to extend our approach to any kind of HPC system and especially to the new Theta system. We also want to work on memory partitioning for workflow performing I/O request. A visit in spring 2018 is scheduled between ANL and INRIA.

References