Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A… Click to show full abstract
Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that $$\mathcal {O}(z)$$O(z) words of space can be significantly (up to exponentially) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number $$r$$r of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z, the measure $$r$$r is closely related to the number of repetitions in the text and can be exponentially smaller than n. We describe two algorithms computing LZ77 in $$\mathcal {O}(r\log n)$$O(rlogn) bits of working space and $$\mathcal {O}(n\log r)$$O(nlogr) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.
               
Click one of the above tabs to view related content.