LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data

Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or… Click to show full abstract

Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov Models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Here, we introduce a new approach that replaces these handcrafted statistical techniques with a single deep generative model. The model’s input is the set of reads aligning to a single genomic region, and the model produces two sets of output tokens, each representing the nucleotide sequence of a germline haplotype. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct germline haplotypes in a generative fashion identical to modern Large Language Models (LLMs). We train our model on 37 Whole Genome Sequences (WGS) from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3 and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.

Keywords: detection; generation sequencing; next generation; sequencing data; model; variant detection

Journal Title: Bioinformatics
Year Published: 2024

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.