Nucleotide diversity remains an important statistic in population genetic/genomic studies. Although recent advances in massive sequencing make generating sequence data sets cheaper and faster, currently used technologies often introduce substantial… Click to show full abstract
Nucleotide diversity remains an important statistic in population genetic/genomic studies. Although recent advances in massive sequencing make generating sequence data sets cheaper and faster, currently used technologies often introduce substantial amounts of missing nucleotides in their output. A novel method of estimating π from data sets containing missing data – pixy ‐ has also recently been proposed. In this study, the pixy estimator, πpixy, was compared to average weighted nucleotide diversity, πW. The estimators were tested both on sequences simulated in fastsimcoal and real sequence sets. Both sets were modified by random insertion of missing nucleotides. Weighted nucleotide diversity performed better in all pairwise comparisons. It was characterized by a smaller error and a narrower distribution of the results. πpixy tends to overestimate the nucleotide diversity when both the proportion of missing data and the level of variation is low. Of the two estimators, only πW estimated the true nucleotide diversity in a part of the simulations. A simple formula for estimating πW allows for easy integration of the estimator in packages such as pixy, which would allow obtaining more precise estimates of nucleotide diversity either in a sliding window or for discrete genomic regions.
               
Click one of the above tabs to view related content.