| Literature DB >> 34952565 |
Meifang Qi1,2,3, Utthara Nayar2,3,4,5,6, Leif S Ludwig2,3,7,8, Nikhil Wagle2,3,4, Esther Rheinbay9,10,11,12.
Abstract
BACKGROUND: Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in next-generation sequencing libraries derived from this system. If not properly recognized and managed, this cross-contamination with exogenous signal can lead to incorrect interpretation of research results. Yet, this problem is not routinely addressed in current sequence processing pipelines.Entities:
Keywords: Contamination; Genomics; Quality control; Software; cDNA
Mesh:
Substances:
Year: 2021 PMID: 34952565 PMCID: PMC8709999 DOI: 10.1186/s12859-021-04529-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow of cDNA-detector. A Schematic illustrating the source of intentional or contaminant cDNA reads in sequencing experiments. Vectors (grey) with cDNA (red) are amplified in the library preparation process and sequenced together with the experiment. Upon alignment of reads, cDNA-derived reads (red boxes) map to the respective gene locus in the genome, along with true signal reads (blue). Textured red read segments indicate sequence not mapping to the genome and “clipped” by the alignment algorithm. B Overview of the two main components of the cDNA-detector algorithm, “detect” and “decontaminate”. C ATAC-seq experiment from TCGA before (pink) and after (green) removal of vector-introduced KRAS. For comparison, an uncontaminated sample (grey) is shown. Boxes indicate contaminant signal over exons
Fig. 2Performance of cDNA-detector. A Recall, precision and F1 score for cDNA-detector and related tools for detection of spiked-in contaminant reads (150 bp paired-end reads) in a simulated data set. Error bars indicate standard errors. Sample size n = 10 for each experiment. B Recall and precision of cDNA-detector and Vecuum on single-end (left) and paired-end (right) sequencing experiments, depending on read coverage. Error bars indicate standard errors. Sample size n = 10 for each library type and read length
Fig. 3cDNA contamination in published datasets. A Examples of ATAC-seq signal from two primary tumor samples from the TCGA [5] showing contamination with DDX58 (encoding the antiviral innate immune response receptor RIG-I; left) or the cohesin component STAG2 (right). True signal would be expected at the promoter and potential intragenic regulatory elements, but not over all exons. Arrowheads indicate spurious signal peak calls caused by contaminant reads over exons (black boxes in gene track; peak calls obtained from ref [5]). B cDNA contamination with PPARG in a FOXK2 ChIP-seq experiment in HEK293T cells and PAX7 in an EZH2 ChIP-seq experiment in HUVEC cells from the ENCODE project. Arrowheads indicate official ENCODE peak calls due to contaminant signal over exons. C Examples of cDNA contamination with prostate cancer genes FOXO1 and SPOP in an androgen receptor (AR) ChIP-seq experiment performed in the prostate cancer cell line C2-4 [36, 37]. D Example of transduced human NRAS cDNA in a mouse ATAC-seq experiment in cell line ICC2.7