Scientific computing applications are being increasingly deployed on cloud computing platforms. Transient servers such as EC2 spot instances and Google Preemptible VMs, can be used to lower the costs of… Click to show full abstract
Scientific computing applications are being increasingly deployed on cloud computing platforms. Transient servers such as EC2 spot instances and Google Preemptible VMs, can be used to lower the costs of running applications on the cloud by up to $10\times$. However, the frequent preemptions and resource heterogeneity of these transient servers introduces many challenges in their effective and efficient use. In this paper, we develop techniques for modeling and mitigating preemptions of transient servers, and present SciSpot, a software framework that enables low-cost scientific computing on the cloud. SciSpot deploys applications on Google Cloud Preemptible Virtual Machines that exhibit temporally constrained preemptions: VMs are always preempted in a 24 hour interval. Our empirical analysis shows that the preemption rate is bathtub shaped, which raises multiple fundamental challenges in performance modeling and policy design. We develop a new reliability model for temporally constrained preemptions, and use statistical mechanics to show why the bathtub shape is a fundamental characteristic. SciSpot's design is guided by our observation that many emerging scientific computing applications that integrate machine learning with simulations, can be deployed as ‘`bags’' of jobs, which represent multiple instantiations of the same computation with different physical model parameters. For a bag of jobs, SciSpot finds the optimal transient server on-the-fly, by taking into account the price, performance, and preemption rates of different servers. SciSpot reduces costs by $5\times$ compared to conventional cloud deployments, and reduces makespans by up to $10\times$ compared to conventional high performance computing clusters.
               
Click one of the above tabs to view related content.