LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets

Photo by homajob from unsplash

DNA sequencing has moved into the realm of Big Data due to the rapid development of high-throughput, low cost Next-Generation Sequencing (NGS) technologies. Sequential data compression solutions that once were… Click to show full abstract

DNA sequencing has moved into the realm of Big Data due to the rapid development of high-throughput, low cost Next-Generation Sequencing (NGS) technologies. Sequential data compression solutions that once were sufficient to efficiently store and distribute this information are now falling behind. In this paper we introduce phyNGSC , a hybrid MPI-OpenMP strategy to speedup the compression of big NGS data by combining the features of both distributed and shared memory architectures. Our algorithm balances work-load among processes and threads, alleviates memory latency by exploiting locality, and accelerates I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, we introduce a novel timestamp-based file structure that allows us to write the compressed data in a distributed and non-deterministic fashion while retaining the capability of reconstructing the dataset with its original order. Our experimental results show that phyNGSC achieved compression times for big NGS datasets that were 45 to 98 percent faster than NGS-specific sequential compressors with throughputs of up to 3 GB/s. Our theoretical analysis and experimental results suggest strong scalability with some datasets yielding super-linear speedups and constant efficiency. We were able to compress 1 terabyte of data in under 8 minutes compared to more than 5 hours taken by NGS-specific compression algorithms running sequentially. Compared to other parallel solutions, phyNGSC achieved up to 6x speedups while maintaining a higher compression ratio. The code for this implementation is available at https://github.com/pcdslab/PHYNGSC.

Keywords: hybrid mpi; compression; openmp strategy; mpi openmp; next generation; generation sequencing

Journal Title: IEEE Transactions on Parallel and Distributed Systems
Year Published: 2017

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.