LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Photo by mariusoprea from unsplash

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in… Click to show full abstract

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

Keywords: framework; fault; fault localization; high performance

Journal Title: International Journal of Parallel Programming
Year Published: 2017

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.