LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

DACE: a scalable DP‐means algorithm for clustering extremely large sequence data

Photo by campaign_creators from unsplash

Motivation: Advancements in next‐generation sequencing technology have produced large amounts of reads at low cost in a short time. In metagenomics, 16S and 18S rRNA gene have been widely used… Click to show full abstract

Motivation: Advancements in next‐generation sequencing technology have produced large amounts of reads at low cost in a short time. In metagenomics, 16S and 18S rRNA gene have been widely used as marker genes to profile diversity of microorganisms in environmental samples. Through clustering of sequencing reads we can determine both number of OTUs and their relative abundance. In many applications, clustering of very large sequencing data with high efficiency and accuracy is essential for downstream analysis. Results: Here, we report a scalable Dirichlet Process Means (DP‐means) algorithm for clustering extremely large sequencing data, termed DACE. With an efficient random projection partition strategy for parallel clustering, DACE can cluster billions of sequences within a couple of hours. Experimental results show that DACE runs between 6 and 80 times faster than state‐of‐the‐art programs, while maintaining overall better clustering accuracy. Using 80 cores, DACE clustered the Lake Taihu 16S rRNA gene sequencing data (˜316M reads, 30 GB) in 25 min, and the Ocean TARA Eukaryotic 18S rRNA gene sequencing data (˜500M reads, 88 GB) into ˜100 000 clusters within an hour. When applied to the IGC gene catalogs in human gut microbiome (˜10M genes), DACE produced 9.8M clusters with 52K redundant genes in 1.5 hours of running time. Availability and Implementation: DACE is available at https://github.com/tinglab/DACE. Contacts: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords: dace; algorithm clustering; sequencing data; means algorithm; clustering extremely; extremely large

Journal Title: Bioinformatics
Year Published: 2017

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.