LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

SeqArray—a storage‐efficient high‐performance data format for WGS variant calls

Photo from wikipedia

Motivation: Whole‐genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant… Click to show full abstract

Motivation: Whole‐genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text‐based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package ‘SeqArray’ for storing variant calls in an array‐oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high‐performance parallel computing. Results: Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature‐rich, high‐performance programming environment for analysis of WGS variant data. Availability and Implementation: http://www.bioconductor.org/packages/SeqArray Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords: wgs variant; variant calls; variant; data format; high performance

Journal Title: Bioinformatics
Year Published: 2017

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.