Hadoop has become a popular framework for processing data-intensive applications in cloud environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and monitoring the jobs… Click to show full abstract
Hadoop has become a popular framework for processing data-intensive applications in cloud environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions based on collected information about the cloud environment. Our framework relies on predictions made by machine learning algorithms and scheduling policies generated by a Markovian Decision Process (MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+, an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we conduct a large empirical study on a 100-node Hadoop cluster deployed on Amazon Elastic MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers (FIFO, Fair, and Capacity). Results show that ATLAS+ outperforms FIFO, Fair, and Capacity schedulers. ATLAS+ can reduce the number of failed jobs by up to 43 percent and the number of failed tasks by up to 59 percent. On average, ATLAS+ could reduce the total execution time of jobs by 10 minutes, which represents 40 percent of the job execution times, and by up to 3 minutes for tasks, which represents 47 percent of the task execution time. ATLAS+ also reduced CPU and memory usage by 22 and 20 percent, respectively.
               
Click one of the above tabs to view related content.