Although we are reflecting about machine ethics, learn how to live with machines and realize how easy it is to fool deep neural networks, we better do not forget about… Click to show full abstract
Although we are reflecting about machine ethics, learn how to live with machines and realize how easy it is to fool deep neural networks, we better do not forget about the more mundane problems associated with the exponentially accumulating genomicsderived Big Data in public repositories. These data should be reliable, reproducible and directly available for comparative analyses, thus avoiding multiplication of data sets for the same experiment. Given that human experimentation is involved, and according to good scientific practice, we should expect that Big Data carry automatically a quality assessment (QA) tag in the databases, to mark the GOOD, the BAD and the UGLY ones. Even though many of us have come across BAD and UGLY data sets, surprisingly, in the Gene Expression Omnibus (GEO), the European Nucleotide Archive or the Sequence Read Archive, there is no indication of data quality. What is worse, there is no commonly agreed procedure about qualifiers for large-scale next-generation sequencing (NGS) data sets. At the International Journal of Cancer (IJC), we receive large amounts of manuscripts with NGS data, but the majority does not contain information about the data quality. Often, the data sets have not been deposited, and sequencing statistics and information on replicates are missing. Consequently, sequencing data are a “blind spot” in a manuscript for editors and reviewers. Acting as reviewers for other journals, our Scientific Editors know that this is not different for what we generally consider “top-level journals.” This blind spot is particularly worrying for chromatin immunoprecipitation (ChIP)-seq and other enrichment-based sequencing technologies, which are prone to lots of potential artifacts. Indeed, a study in 2014 reported that about 20% of ChIP-seq data sets were of poor, and 25% of intermediate, quality. This is not due to the absence of QA procedures. In fact, several recommendations have been proposed but are rarely followed, leading to the accumulation of poor quality data sets in public repositories, which themselves have no stringent data quality management. Although ChIP-seq has been recognized to be particularly vulnerable to changes of the experimental conditions, an increasing number of genomics technologies is being developed and used. This includes, but is not limited to technologies, such as whole genome sequencing/ whole exome sequencing, whole genome bisulfite sequencing/reduced representation bisulfite sequencing (RRBS), RNA-seq, RNA immunoprecipitation-seq, assay for transposase-accessible chromatin sequencing (ATAC)-seq, (capture) high-throughput chromatin conformation or ribosome profiling and their various modifications, with new technologies like spatial transcriptomics continuously being developed. It is essential that the fields of molecular, systems and computational biology, and oncology develop an awareness for the need of highquality standards. This includes but goes beyond the classical standard of providing true biological replicates also for omics experiments. The increasingly widespread use of single-cell (sc) omics, such as scDNA-seq, scBS-seq/scRRBS, scRNA-seq and scATAC-seq, presents another challenge. Clearly, the amount of nucleic acids present in a single cell is a natural limit for the sequencing depth, which is an important readout of representation when questions like clonal variation and cell fate trajectories are addressed. Therefore, aspects like the total numbers of cells called and uniquely mapped reads per cell are essential to assess the quality and representational value of a single-cell omics data set. Unfortunately, the scientific community has not yet established guidelines for the quality assessment of single-cell omics data. Big Data quality is an important issue for reviewers of submitted manuscripts to judge the validity of the authors' statements. Moreover, the mere submission to a public repository does not ensure high quality, as data sets can be submitted without quality check. Actually— somewhat surprisingly—to obtain an accession number, it is not even necessary to attach a quality assessment document to the data files. Apart from specialists in the field, it is virtually impossible for an average reviewer to rapidly access and quality check data sets, which have been deposited in a public database, such as GEO, the European Genome-phenome Archive or the database of Genotypes and Phenotypes. Please note that, in this respect, for this journal, it is imperative that data sets are deposited and accessible to reviewers (potentially password protected until acceptance) at the time of manuscript submission. To facilitate the reviewers' tasks and to introduce a quality check for Big Data, IJC requests that studies involving original omics-based NGS data must include information about the sequencing coverage and quality statistics of the generated data. The corresponding instructions are accessible in our Author Guidelines [https://onlinelibrary.wiley.com/ pb-assets/assets/10970215/IJC_Sequencing_Coverage_and_Quality_ Statistics_Guidelines-1607431877843.pdf]. This information will help authors to check their data and facilitate the tasks of our reviewers. It is the specific ambition of this journal to ensure that the conclusions reported by our authors are exclusively based on high-quality and reliable Big Data. It is our hope that other journals join this initiative to the benefit of the quality of science and public trust in the results reported.
               
Click one of the above tabs to view related content.