Use of the Folding profiler to assist on data distribution for heterogeneous memory systems

Research topic and goals

We are extending the research on data distribution and partitioning for heterogeneous memory systems started at Argonne (Peña and Balaji 2014). This approach is based on an emulator-based data-oriented profiler (now named EVOP) (Peña and Balaji 2014). However, the profiling stage is time-consuming. We are evaluating the possibility of adapting and using the profiling tool “Folding” from BSC for this purpose (Servat et al. 2015). Since it is based on hardware counters, it seems clear that the profiling time will be greatly reduced. Given the lossy nature of profilers based on hardware counters, however, it is interesting to determine if this solution provides sufficient resolution for the subsequent stage to generate a well-optimized data distribution.

We also analyze the potential of a runtime support for heterogeneous memory systems. Profiling can be used to find the optimal data distribution, but it is limited to the system configuration. Runtime support may be helpful if highly accessed objects are not fitting to the fastest available memory layer and therefore must be allocated in a slower memory region. Here, efficient prefetching between the slow memory layer and a software-managed fast memory cache may be helpful.

Results for 2015/2016

So far we have:

  • Modified the Extrae profiler to generate EVOP-like reports with PEBS data.
  • Developed a mechanism to compare the results.
  • Performed early profiling performance and quality of object distribution evaluation.

We have noticed that our distributions from EVOP and Extrae data do not always match. This may be attributed to a combination of two factors: data loss, but also different cache behaviors (in EVOP the cache is simulated after the queried system cache properties).

We used the profiling data to identify objects where runtime support may be useful. To evaluate the beneficing of prefetching, we used an emulation. Slow memory is emulated by using the XeonPhi device memory, which is mapped into the host address space. Our emulation shows that, in the case highly accessed objects are not fitting into fast memory, prefetching can be very useful to increase the performance.

Results for 2016/2017

We are following separate paths until we get ready to combine our research. ANL has been working in programming models, whereas BSC has been working in the underlying tools.

Visits and meetings

Frequent teleconferences and e-mail exchanges. No planned visits yet. Antonio J. Peña (BSC) moved from Argonne to BSC.

Impact and publications

We are writing separate papers.

    Future plans

    Validate the sampling technique for data distribution by implementing and evaluating sampling on EVOP. Joint software release.


    1. Servat, Harald, Germán Llort, Juan González, Judit Giménez, and Jesús Labarta. 2015. “Low-Overhead Detection Of Memory Access Patterns and Their Time Evolution.” In Euro-Par 2015: Parallel Processing.
        author = {Servat, Harald and Llort, Germ{\'a}n and Gonz{\'a}lez, Juan and Gim{\'e}nez, Judit and Labarta, Jes{\'u}s},
        booktitle = {Euro-Par 2015: Parallel Processing},
        title = {Low-Overhead Detection of Memory Access Patterns and Their Time Evolution},
        year = {2015}
    2. Peña, Antonio J, and Pavan Balaji. 2014. “Toward The Efficient Use of Multiple Explicitly Managed Memory Subsystems.” In IEEE Cluster.
        author = {Pe{\~n}a, Antonio J and Balaji, Pavan},
        booktitle = {IEEE Cluster},
        title = {Toward the Efficient Use of Multiple Explicitly Managed Memory Subsystems},
        year = {2014}
    3. Peña, Antonio J., and Pavan Balaji. 2014. “A Framework For Tracking Memory Accesses in Scientific Applications.” In 43nd International Conference On Parallel Processing Workshops (ICPP Workshops).
        author = {{Pe\~na}, Antonio J. and Balaji, Pavan},
        booktitle = {43nd International Conference on Parallel Processing Workshops (ICPP Workshops)},
        title = {A Framework for Tracking Memory Accesses in Scientific Applications},
        year = {2014}