As the scientific community prepares to deploy an increasingly complex and diverse set of applications on exascale platforms, the need to assess reproducibility of simulations and identify the root causes… Click to show full abstract
As the scientific community prepares to deploy an increasingly complex and diverse set of applications on exascale platforms, the need to assess reproducibility of simulations and identify the root causes of reproducibility failures increases correspondingly. One of the greatest challenges facing reproducibility issues at exascale is the inherent non-determinism at the level of inter-process communication. The use of non-deterministic communication constructs is necessary to boost performance, but communication non-determinism can also hamper software correctness and result reproducibility. To address this challenge, we propose a software framework for identifying the percentage and sources of communication non-determinism. We model parallel executions as directed graphs and leverage graph kernels to characterize run-to-run variations in inter-process communication. We demonstrate the effectiveness of graph kernel similarity as a proxy for non-determinism, by showing that these kernels can quantify the type and degree of non-determinism present in communication patterns. To demonstrate our frameworkâs ability to link and quantify runtime non-determinism to root sources, demonstrate with present for an adaptive mesh refinement application, where our framework automatically quantifies the impact of function calls on non-determinism, and a Monte Carlo application, where our framework automatically quantifies the impact of parameter configurations on non-determinism.
               
Click one of the above tabs to view related content.