Big data analytics has attracted close attention from both industry and academic because of its great benefits in cost reduction and better decision making. As the fast growth of various… Click to show full abstract
Big data analytics has attracted close attention from both industry and academic because of its great benefits in cost reduction and better decision making. As the fast growth of various global services, there is an increasing need for big data analytics across multiple data centers (DCs) located in different countries or regions. It asks for the support of a cross-DC data processing platform optimized for the geo-distributed computing environment. Although some recent efforts have been made for geo-distributed big data analytics, they cannot guarantee predictable job completion time, and would incur excessive traffic over the inter-DC network that is a scarce resource shared by many applications. In this paper, we study to minimize the inter-DC traffic generated by MapReduce jobs targeting on geo-distributed big data, while providing predicted job completion time. To achieve this goal, we formulate an optimization problem by jointly considering input data movement and task placement. Furthermore, we guarantee predictable job completion time by applying the chance-constrained optimization technique, such that the MapReduce job can finish within a predefined job completion time with high probability. To evaluate the performance of our proposal, we conduct extensive simulations using real traces generated by a set of queries on Hive. The results show that our proposal can reduce 55 percent inter-DC traffic compared with centralized processing by aggregating all data to a single data center.
               
Click one of the above tabs to view related content.