| Literature DB >> 33479725 |
Sanket Desai1,2, Sonal Rashmi1, Aishwarya Rane1, Bhasker Dharavath1,2, Aniket Sawant1,2, Amit Dutt1,2,3.
Abstract
The analysis of the SARS-CoV-2 genome datasets has significantly advanced our understanding of the biology and genomic adaptability of the virus. However, the plurality of advanced sequencing datasets-such as short and long reads-presents a formidable computational challenge to uniformly perform quantitative, variant or phylogenetic analysis, thus limiting its application in public health laboratories engaged in studying epidemic outbreaks. We present a computational tool, Infectious Pathogen Detector (IPD), to perform integrated analysis of diverse genomic datasets, with a customized analytical module for the SARS-CoV-2 virus. The IPD pipeline quantitates individual occurrences of 1060 pathogens and performs mutation and phylogenetic analysis from heterogeneous sequencing datasets. Using IPD, we demonstrate a varying burden (5.055-999655.7 fragments per million) of SARS-CoV-2 transcripts across 1500 short- and long-read sequencing SARS-CoV-2 datasets and identify 4634 SARS-CoV-2 variants (~3.05 variants per sample), including 449 novel variants, across the genome with distinct hotspot mutations in the ORF1ab and S genes along with their phylogenetic relationships establishing the utility of IPD in tracing the genome isolates from the genomic data (as accessed on 11 June 2020). The IPD predicts the occurrence and dynamics of variability among infectious pathogens-with a potential for direct utility in the COVID-19 pandemic and beyond to help automate the sequencing-based pathogen analysis and in responding to public health threats, efficaciously. A graphical user interface (GUI)-enabled desktop application is freely available for download for the academic users at http://www.actrec.gov.in/pi-webpages/AmitDutt/IPD/IPD.html and for web-based processing at http://ipd.actrec.gov.in/ipdweb/ to generate an automated report without any prior computational know-how.Entities:
Keywords: COVID-19 pandemic; computational subtraction; graphical user interface (GUI); infectious pathogen detection; severe acute respiratory syndrome coronavirus 2
Mesh:
Year: 2021 PMID: 33479725 PMCID: PMC7929363 DOI: 10.1093/bib/bbaa437
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1Analytical pipeline of IPD. The pipeline is composed of three modules: (A) heterogeneous data filtration/alignment, (B) integrated variant calling and quantification, and (C) specialized SARS-CoV-2 analysis module.
Figure 2Conceptual algorithm to compute and assign phylogenetic clades based on the variants obtained in a sample, using GISAID-based IPD variation database as reference.
Figure 3Integrated pathogen quantification and variant analysis using IPD (A) heatmap representation of the pathogens having minimal burden of 1 FPM in at least 1% of the samples in both short-read (left-panel) and long-read (right panel) data. The sample set consisted of 1440 SARS-CoV-2–positive (1035 short and 405 long read), 16 MERS-positive and 44 SARS-CoV-2–negative samples. Left panel shows bacterial pathogens summed up into a single entity in the plot as ‘Pathogenic bacteria’. A detailed heatmap of bacterial pathogens has been provided in Supplementary Figure S3 available online at https://academic.oup.com/bib. (B) IPD-based variant analysis of publicly available SARS-CoV-2–positive sequencing samples; left panel shows the position-wise mutation count generated by IPD variant analysis of short- (n = 1095) and long-read (n = 405) sequencing data. The right panel shows the mutation distribution of the Snippy-based variant analysis of the GISAID genomes (n = 23 376). The hotspot mutation positions (241, 3037, 14 408, 23 403) are marked in the plot along with the common mutant alleles observed in the IPD-based and Snippy-based analysis, respectively. X-axis (bottom) shows the overlay of the gene annotations of the SARS-CoV-2 (arrows are used to indicate specific genes in the genome).
Figure 4IPD benchmarking (A) F-score plot for short-read quantification for SARS-CoV-2 (N = 71), F. nucleatum (N = 17) and HPV (N = 88) truth set, (B) correlation matrix of the normalized qPCR load of F. nucleatum with the quantification using IPD, PathoScope2, Kraken2 and GATK-PathSeq. Negative correlation between the qPCR data and the quantification by all the tools has been multiplied with minus one for representation of (C) F1-score, sensitivity and precision of different variant calling tools/pipelines on the SARS-CoV-2–simulated short-read dataset (N = 36), (D) accuracy of SARS-CoV-2 lineage prediction by IPD SARS-CoV-2 module, based on the variants derived from short- and long-read samples (N = 53). The X-axis denotes the random background mutation rate introduced in the simulated dataset.
Figure 5Automated report generation in IPD. HTML report as generated by the IPD SARS-CoV-2 module, containing sequencing statistics, pathogen quantification and genome coverage and variant information, as three major sections.