LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Bandwidth-aware Scheduling Repair Technique in Erasure-coded Clusters Design and Analysis

Photo by jontyson from unsplash

Erasure codes offer a storage-efficient redundancy mechanism for maintaining data availability guarantees in storage clusters, yet also incur high network traffic consumption and recovery time in failure repair. Extensive research… Click to show full abstract

Erasure codes offer a storage-efficient redundancy mechanism for maintaining data availability guarantees in storage clusters, yet also incur high network traffic consumption and recovery time in failure repair. Extensive research has been carried out to reduce the recovery time. However, previous works either target specific erasure code constructions which are not commonly used in todays distributed storage clusters or neglect the bandwidth heterogeneous property in real network environments. Since erasure-coded clusters are typically composed of multi-node with heterogeneous bandwidth and accessed in parallel, the whole recovery time is mainly restricted by the low-bandwidth links. In this paper, we propose SMFRepair, a single-node multi-level forwarding repair technique that is designed to improve the performance in heterogeneous networks based on Reed-Solomon codes for general fault tolerance. SMFRepair carefully selects the helper nodes and uses idle nodes to bypass low-bandwidth links. Idle nodes have sufficient and unused network bandwidth. It also pipelines the repair links that are optimized by idle nodes. Furthermore, a multi-node scheduling repair technique, called MSRepair, is proposed. MSRepair carefully schedules the multi-node repair link to saturate the most unoccupied bandwidth and transfers data from as large-bandwidth links as possible, with the primary objective of minimizing the recovery time. Large-scale simulation and Amazon EC2 real experiments show that compared to state-of-the-art repair techniques, SMFRepair can accelerate the single-node recovery by up to 47.69%, and MSRepair can reduce the multi-node recovery time by 33.78%~67.53%.

Keywords: recovery time; repair technique; erasure; repair

Journal Title: IEEE Transactions on Parallel and Distributed Systems
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.