Dependency-aware jobs, such as the big data analytic workflows, are commonly executed on the cloud. They are compiled to directed acyclic graphs, with tasks linked in regarding the dependency. The… Click to show full abstract
Dependency-aware jobs, such as the big data analytic workflows, are commonly executed on the cloud. They are compiled to directed acyclic graphs, with tasks linked in regarding the dependency. The cloud scheduler, which maintains a large number of resources, is responsible to execute tasks in parallel. To resolve the complex dependencies, Deep Reinforcement Learning (DRL) based schedulers are widely applied. However, we find that the DRL-based schedulers are vulnerable to the perturbations in the input jobs and may generate falsified decisions to benefit a particular job while delaying the others. By perturbation, we mean a slight adjustment to the job's node features or dependencies, while not changing its functionality. In this paper, we first explore the vulnerability of DRL-based schedulers to job perturbations without accessing the information of the DRL models used in the scheduler. We devise the black-box perturbation system, in which, a proxy model is trained to mimic the DRL-based scheduling policy. We show that the high-faith proxy model can help to craft effective perturbations. The DRL-based schedulers can be as high as 60% likely to be badly affected by the perturbations. Then, we investigate the solution to improve the robustness of DRL-based schedulers to such perturbations. We propose an adversarial training framework to force the neural model to adapt to the perturbation patterns during training so as to eliminate the potential damage during applications. Experiments show that the adversarial-trained scheduler is more robust, reducing the chance of being affected to 3-fold less and the potential bad effects halved.
               
Click one of the above tabs to view related content.