Copy number variation (CNV) is a type of genomic/genetic variation that plays an important role in phenotypic diversity, evolution, and disease susceptibility. Next generation sequencing (NGS) technologies have created an… Click to show full abstract
Copy number variation (CNV) is a type of genomic/genetic variation that plays an important role in phenotypic diversity, evolution, and disease susceptibility. Next generation sequencing (NGS) technologies have created an opportunity for more accurate detection of CNVs with higher resolution. However, efficient and precise detection of CNVs remains challenging due to high levels of noise and biases, data heterogeneity, and the “big data” nature of NGS data. Sequence coverage (readcount) data are mostly used for detecting CNVs, specially for whole exome sequencing data. Readcount data are contaminated with several types of biases and noise that hinder accurate detection of CNVs. In this work, we introduce a novel preprocessing pipeline for reducing noise and biases to improve the detection accuracy of CNVs in heterogeneous NGS data, such as cancer whole exome sequencing data. We have employed several normalization methods to reduce readcount's biases that are due to GC content of reads, read alignment problems, and sample impurity. We have also developed a novel efficient and effective smoothing approach based on Taut String to reduce noise and increase CNV detection power. Using simulated and real data we showed that employing the proposed preprocessing pipeline significantly improves the accuracy of CNV detection.
               
Click one of the above tabs to view related content.