Network Simulations and Topology-aware Communications

Research topic and goals

High-radix direct network topologies such as Dragonfly have been proposed for
petascale and exascale supercomputers because they ensure fast interconnections
and reduce the cost of the network compared with traditional network topologies. The design of new machines such as Theta with a Dragonfly network present an opportunity
to further improve the performance of distributed applications by making the algorithms
aware of the topology. Indeed, current algorithms do not consider the topology and thus lose numerous opportunities of optimization for performance that have been created by the topology. This project aims to explores ways to exploit the strengths of the Dragonfly network topology to propose and evaluate optimized algorithms global communication operations, such as AllGather, Scatter, etc.

Results for 2016/2017

We studied and extended existing algorithms for collective communication operations and use CODES, an event-driven simulator, to evaluate them. The simulations show expected results for AllGather, as well as surprising ones for Scatter:
the naive algorithm perform oustandingly well on Dragonfly because they exploit the characteristics
of the routers in the network. In particular, the Scatter operation could be accelerated up by a factor up to 2 times using a hardware aware algorithm.

These results have been accepted as a poster for the ACM Student Research Competition at SC 2016 (Cheriere and Dorier 2016), and Nathanael Cheriere won the 2nd prize of the ACM SRC.

Results for 2017/2018

We developped a Swift/T-based workflow to automatize a large number of experiments using CODES, in order to boost design-space exploration. This workflow helped us reiterate our experiments with network models matching Argonne’s Theta machine. We submitted a paper at CCgrid 2018 presenting the results of these experiments.

Visits and meetings

Internship of Nathanael Cheriere at Argonne National Lab from January 2016 to June 2016, under the supervision of Matthieu Dorier and Rob Ross.

Impact and publications

  1. Cheriere, Nathanael, and Matthieu Dorier. 2016. “Design and Evaluation of Topology-Aware Scatter and AllGather Algorithms for Dragonfly Networks.” In IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC) - ACM Student Research Competition. http://sc16.supercomputing.org/sc-archive/src_poster/src_poster_pages/spost146.html.
    @inproceedings{CheriereEtAl2016,
      title = {{Design and Evaluation of Topology-aware Scatter and AllGather Algorithms for Dragonfly Networks}},
      author = {Cheriere, Nathanael and Dorier, Matthieu},
      booktitle = {{IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC) - ACM Student Research Competition}},
      year = {2016},
      url = {http://sc16.supercomputing.org/sc-archive/src_poster/src_poster_pages/spost146.html},
      pdf = {http://sc16.supercomputing.org/sc-archive/src_poster/poster_files/spost146s2-file2.pdf}
    }
    

Future plans

This project is finished.

References