LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Lightweight hash-based de-duplication system using the self detection of most repeated patterns as chunks divisors

Photo from wikipedia

Abstract Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a… Click to show full abstract

Abstract Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunk-size that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.

Keywords: hash; system; repeated patterns; lightweight hash; storage; deduplication

Journal Title: Journal of King Saud University - Computer and Information Sciences
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.