| Literature DB >> 33272199 |
Rebecca M Rodriguez1,2,3, Vedbar S Khadka4, Mark Menor1, Brenda Y Hernandez5,6, Youping Deng7.
Abstract
Cancer is one of the leading causes of morbidity and mortality in the globe. Microbiological infections account for up to 20% of the total global cancer burden. The human microbiota within each organ system is distinct, and their compositional variation and interactions with the human host have been known to attribute detrimental and beneficial effects on tumor progression. With the advent of next generation sequencing (NGS) technologies, data generated from NGS is being used for pathogen detection in cancer. Numerous bioinformatics computational frameworks have been developed to study viral information from host-sequencing data and can be adapted to bacterial studies. This review highlights existing popular computational frameworks that utilize NGS data as input to decipher microbial composition, which output can predict functional compositional differences with clinically relevant applicability in the development of treatment and prevention strategies.Entities:
Keywords: Cancer microbiome; Computational frameworks; NGS
Mesh:
Year: 2020 PMID: 33272199 PMCID: PMC7713026 DOI: 10.1186/s12859-020-03831-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Known and suspected microbial association with cancer pathogenesis
| Cancer type | Known microbial associations | Suspected agents | References |
|---|---|---|---|
Breast Triple-negative, HER2+, ER+ | None | Epstein–Barr virus, human papillomaviruses | [ |
Prostate Prostate adenocarcinoma | None | Microbial dysbiosis | [ |
Stomach Stomach adenocarcinoma | Epstein Barr Virus | Microbial dysbiosis | [ |
Liver Liver and intrahepatic bile duct | Hepatitis viruses, Parasitic infections | [ | |
Cervical Cervical squamous cell and endometrial carcinoma | Human papillomaviruses | [ | |
Head and Neck Oropharyngeal and laryngeal | Epstein Barr Virus, Human papillomaviruses | [ | |
Colon and rectum Colorectal adenocarcinoma | Microbial dysbiosis | Human papillomavirus | [ |
Kidney Renal cell carcinoma and clear cell carcinoma | None | Hepatitis C virus Epstein Barr Virus Urinary tract infection-associated pathogens | [ |
Lung Lung squamous cell and adenocarcinomas | None | Epstein Barr Virus Molluscum Contagiosum virus Microbial dysbiosis | [ |
Bladder Bladder squamous cell carcinoma | Human papillomavirus Epstein–Barr Virus | [ |
Common cancer types listing known and suspected microbial (viral, bacterial, and other) agents associated with cancer pathogenesis or that have been identified as common causes of infection in cancer patients, which may play a role in patient inter-variability
Fig. 1Generic pipeline comparing three basic computational frameworks designed to identify microbial reads from human sequences. Generic pipelines can be summarized into three general stages, pre-processing (blue), processing (yellow), and analyses post-processing (green). During pre-process, most methodologies trim and quality filter sequencing reads. Quality reads are mapped and aligned during the processing steps to either human or pathogen reference sequences or key identifying factors before making a final identification call. Once species have been identified, their composition is characterized in detail, depending on the methodology being used. Finally, having taxonomic classification and compositional structure permits downstream correlation analyses and functional-relevant identification of molecular pathways. Differential functional prediction and patient inter-variability aid in the identification of novel microbe based prevention and treatment strategies
Computational frameworks designed to detect microbiota from human sequences by subtractive, filtration, or mixed methods
| Framework | Approach | Dependencies | Input | output | Advantages/disadvantages | Cancer validation | Refs. |
|---|---|---|---|---|---|---|
| PathSeq | Alignment and de novo assembly | BLAST BLASTN BLASTX MAQ MegaBLAST RepeatMasker Velvet | Input: RNA-seq or DNA-seq Output: Pathogen presence/absence | Scalable cloud computing Feasible for known and novel pathogen identification Two-pass subtraction with increased filtering costs | Cervical cancer (cell line and simulated data) TCGA ovarian | [ |
| SRSA | Alignment and de novo assembly | Velvet MegaBLAST BLAST BWA TopHat | Input: RNA-seq Output: Species-level taxonomy characterization (prevalence) | Incorporates sample pre-processing, quality filtering, sequence mapping, and assembly Not freely available No known updates Original work validation was limited to cell line | HIV-1 cell line | [ |
| CaPSID | Mix-method, simultaneous alignment, filtration and de novo assembly | BioPython Bowtie2 Trinity | Input: RNA-seq or DNA-seq Output: Top-hit pathogen genome identification ranked by maximum gene coverage | Web-based, open-source and scalable application; Modular analyses; Single pass filtering, which may fail to subtract host reads | Ovarian cancer TCGA stomach | [ |
| SURPI | Dual scanning mode; Known pathogens identification or de novo assembly | SNAP RAPSearch BWA BLASTN Bowtie2 DUST in PRINSEQ | Input: Paired-end metagenomic Output: Species-level taxonomic classification and coverage map | Scalable to cloud or standalone servers Capacity to incorporate reference database Dual-mode: quantitative and semi-quantitative pathogen identification | Prostate cancer (cell line, tissue biopsies) Colorectal cancer (tissue biopsies) | [ |
| PathoScope 2.0 | Penalized probabilistic identification; Modular filtration, alignment and assignment | SAMtools BLASTX Bowtie2 thetaPrior | Input: Metagenomic or genomic (RNA-seq or DNA-seq) Output: Strain level pathogen relative abundance | Modular detailed result reporting with Designed for low abundance strain-level identification MySQL server required; no connection to the population structure of relevant species | TCGA stomach | [ |
| VirusScan | Identification of known viral and integration sites | BWA BLAST MegaBLAST Pindel RepeatMasker PHYLIP | Input: RNA-seq Output: Viral read abundance and integration sites | Designed for viral identification; Abundance and integration sites analyses | TCGA cancer cohorts | [ |
| MetaShot | Two-step similarity filtering and taxonomic assessment | Bowtie2 TANGO STAR Bash | Input: RNA-Seq or DNA-Seq Output: Assigned read report and Krona plot with relative abundance | Extracts unassigned reads; Allow for functional annotations; Slower than other applications | None | [ |
| ConStrains | Marker-based (SNP patterns) Strain-level prediction | MetaPhlAn PhyloPhlAn Bowtie2 SAMtools Metropolis-Hasting Monte-Carlo | Input: Metagenomics (RNA-seq) Output: Strain-level prediction and relative abundance | Single reference strain collection; Facilitates functional analyses when combined with reference genome-based gene coverage metadata | None | [ |
| RINS | Intersection based identification and removal | Bowtie BLAST BLAT Trinity | Input: Mate-paired RNA-seq unmapped reads Output: Pathogen contigs | Requires prior knowledge of reference; Detection limited to user-defined parameters | Prostate cancer (cell line) | [ |
| GRAMMy | Mix- model Bayesian, Expectation–Maximization and maximum likelihood estimation | BLAST BLAT MAQ Bowtie PerM BLASY | Input: Metagenomics reads Output: Genomic relative abundance as numerical vectors | User flexibility Probabilistic handling of ambiguous hits Computational efficiency | None | [ |
Comparison of computational workflows designed to derive microbial content from human sequences by subtractive and filtering methods, broadly categorized as reference-based, reference-free, and mixed methods approaches. Data requirements to run the pipeline, output information, as well as advantages and disadvantages for each, are summarized. Most have been validated with large cancer datasets, including TCGA sequencing data. ConStrains is based on reference-free, while all other approaches are reference-based or mixed-methods