Abstract Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we… Click to show full abstract
Abstract Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. Results ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. Availability and implementation ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
               
Click one of the above tabs to view related content.