Recognizing the impact from cluster recovery operations on performance and mean time to recovery (MTTR) is essential to maintain service and availability objectives. Testing the impact of recovery operation can… Click to show full abstract
Recognizing the impact from cluster recovery operations on performance and mean time to recovery (MTTR) is essential to maintain service and availability objectives. Testing the impact of recovery operation can reveal a cause-and-effect relationship between recovery parameters and responses of performance and MTTR during the recovery process. This study introduces a combination of systematic methodologies of design of experiments and response surface methodologies to effectively and efficiently find out main and factor-to-factor interaction effects toward the responses. Two Ceph clusters using different storage device technologies, HDD and SSD respectively, were used to characterize the impact of recovery operation on performance and MTTR. The combination of quadratic and linear effects from both Ceph clusters were determined and reported. With 28 tests, MTTR and performance models were developed for each Ceph cluster based on those quadratic and linear effects. These models demonstrate good prediction on performance and MTTR when recovery parameters are adjusted. Using design of experiment and response surface not only allow cause and effect analysis, but also provide the potential inefficient parameter that causes performance loss during recovery. This not only introduces a new method to study cause-and-effect in MTTR but serves as the indicator to areas for improvement for more efficient recovery operation.
               
Click one of the above tabs to view related content.