Predicting failures and acting proactively have a potential to improve availability as a correct prediction and a successful mitigation may bring a reward resulting in decrease of downtime and availability… Click to show full abstract
Predicting failures and acting proactively have a potential to improve availability as a correct prediction and a successful mitigation may bring a reward resulting in decrease of downtime and availability improvement. But, conversely, each incorrect prediction may introduce additional downtime (penalty). Therefore, depending on the quality of prediction and the system parameters, predictive fault-tolerance methods may improve or may degrade availability in comparison to the reactive ones. We first derive taxonomies of fault-tolerant techniques and policies to differentiate between reactive and proactive policies that are further classified as systematic and predictive. To evaluate whether a predictive policy improves availability or not, we derive an analytical model for availability quantification. We use Markov chains to extend steady-state availability equation to include: precision and recall, penalty and reward, mitigation success probability and potential failure rate increase due to the prediction load. We also derive an A-measure to optimize failure prediction for increasing availability. In our conclusion, precision and recall have comparable impact on availability as changing MTTF and MTTR. To validate the model we also simulate and analyze availability of a virtualized server with exponential distribution of failure and repair rates.
               
Click one of the above tabs to view related content.