| Literature DB >> 29901703 |
L M Simon1, S Karg1, A J Westermann2,3, M Engel1,4, A H A Elbehery5, B Hense1, M Heinig1, L Deng5, F J Theis1,6.
Abstract
Background: With the advent of the age of big data in bioinformatics, large volumes of data and high-performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts; however, its generic nature also enables the detection of microbial and viral transcripts. Findings: We developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from six independent, controlled infection experiments of cell line models and compared them with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from more than 17,000 samples from more than 400 studies relevant to human disease using state-of-the-art high-performance computing systems. The resulting data from this large-scale re-analysis are made available in the presented MetaMap resource. Conclusions: Our results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation toward the role of the microbiome in human disease. Additionally, codes to process new datasets and perform statistical analyses are made available.Entities:
Mesh:
Year: 2018 PMID: 29901703 PMCID: PMC6025204 DOI: 10.1093/gigascience/giy070
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Schematic of the MetaMap pipeline. More than 400 projects from studies relevant to human disease were identified in the SRA database. More than 500 billion RNA-seq reads were downloaded and first filtered by mapping them onto the human genome. The remaining reads underwent metafeature classification. It is noted that 90.7% of all reads mapped to the human genome; 0.03%, 0.20%, and 0.39% of all reads were assigned to archaeal, bacterial, or viral metafeatures, respectively; and 8.6% of all reads remain nondiscriminative at the species level (“unclassified”).
Overview of four dual RNA-seq studies used to validate the MetaMap pipeline.
| Study | Infection agent | Total reads |
| Alphapapillomavirus 9 | Human alphaherpesvirus 1 | Rhinovirus A |
|---|---|---|---|---|---|---|
| Westermann et al. [27] |
| 1.0e+07 |
| 1.2e-01 | 1.5e-01 | 1.2e-01 |
| Zhang et al. [28] | Human papillomavirus | 4.6e+07 | 3.0e-02 |
| 2.2e-02 | 2.2e-02 |
| Rutkowski et al. [29] | Herpes simplex virus | 3.5e+07 | 1.1e+00 | 3.1e-02 |
| 3.0e-02 |
| Bai et al. [30] | Rhinovirus | 6.6e+06 | 2.0e-01 | 1.5e-01 | 1.5e-01 |
|
Total reads column depicts the average read depth per sample for each study. Average metafeature abundance for alphapapillomavirus 9, Salmonella enterica, human alphaherpesvirus 1, and rhinovirus A are shown in reads per million. The correct infection agent for the respective study is highlighted in bold font
Figure 2:Differential metafeature abundance analysis of controlled infection experiments recovers ground truth. “Volcano” plots show fold change and inverted P value on the x and y axes, respectively. Each dot represents a metafeature. The most significant metafeature is colored in red. Insets display box plots of the abundance levels in reads per million of the top hit metafeature across conditions for each study. For all box plots, the box represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent 1.5 times the interquartile range.
Figure 3:Analysis of lymphoblast cell line experiments further supports the MetaMap pipeline. (A and B) Mean abundance levels across all samples of the top five metafeatures for projects SRP041338 and SRP091453, respectively. (C) Relative proportion of reads mapping to EBV, phiX, and all other metafeatures across RNA-seq samples. (D) Cumulative distribution plot of the average proportion of bacterial metafeature reads across all projects. Purple and pink vertical lines highlight projects SRP041338 and SRP091453, respectively.
Figure 4:Alternative BLAST-based classification method validates metafeature abundance estimates by MetaMap. (A) Average metafeature reads per million levels derived using the CLARK-S software, as implemented in the MetaMap pipeline, and a BLAST-based alternative approach on the x and y axes, respectively. (B) Correlation in S. enterica abundance levels between the two classification approaches. (C) Difference in classification speed between the BLAST and CLARK-S metatranscriptomic classification. The y axis shows the number of reads processed per hour per thread in log10 space.