| Literature DB >> 27400380 |
Cuncong Zhong1, Anna Edlund1,2, Youngik Yang1, Jeffrey S McLean1,3, Shibu Yooseph1.
Abstract
Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27400380 PMCID: PMC4939949 DOI: 10.1371/journal.pcbi.1004991
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1The HMM-GRASPx pipeline for homology search and gene-centric assembly.
Blue shading of the objects indicates that the corresponding data or operation is in nucleotide space, while purple shading indicates amino-acid space.
Performances of the programs for searching metabolic protein family profiles against the simulated marine data set with uneven coverage.
| Pathway | #Families | HMM-GRASPx | HMMER3 | RPS-BLAST | UProC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rec. | Prec. | F. | Rec. | Prec. | F. | Rec. | Prec. | F. | Rec. | Prec. | F. | ||
| 47 | 94.3 | 25.0 | 95.7 | 39.7 | 43.8 | 60.4 | 26.8 | 68.0 | 38.5 | ||||
| 69 | 86.5 | 18.0 | 92.2 | 30.1 | 31.4 | 47.2 | 28.7 | 70.1 | 40.7 | ||||
| 21 | 96.2 | 28.6 | 44.4 | 49.2 | 97.7 | 65.5 | 33.1 | 56.4 | 41.7 | ||||
| 111 | 87.4 | 15.6 | 26.7 | 24.4 | 90.9 | 38.4 | 17.8 | 57.1 | 27.1 | ||||
| 80 | 89.4 | 15.3 | 92.9 | 26.3 | 27.1 | 42.2 | 23.6 | 75.6 | 35.9 | ||||
| 124 | 89.7 | 17.2 | 29.1 | 27.6 | 92.8 | 42.5 | 21.0 | 73.3 | 32.7 | ||||
| 49 | 87.3 | 16.3 | 27.7 | 29.5 | 88.8 | 44.2 | 26.7 | 76.9 | 39.6 | ||||
| 7 | 93.7 | 11.1 | 93.4 | 19.8 | 24.3 | 38.9 | 8.3 | 25.1 | 12.4 | ||||
| - | 90.6 | 18.4 | 30.5 | 32.2 | 47.4 | 23.2 | 62.8 | 33.6 | |||||
Pathway names: KO00010 (Glycolysis/Glycogenesis), KO00020 (TCA cycle), KO00030 (Pentose phosphate pathway), KO00051 (Fructose and mannose metabolism), KO00620 (Pyruvate metabolism), KO00680 (Methane metabolism), KO00910 (Nitrogen metabolism), KO00920 (Sulfur metabolism). The column “#Families” indicates the number of protein (domain) families involved in the corresponding pathway. The columns “Rec.”, “Prec.”, and “F.” indicate Recall, Precision, and F-measure, respectively. All performances are presented as percentages. The highest performances among all programs are bolded.
Running time for all programs on the simulated marine data set is available in S4 Table.
Performances of the programs for searching biosynthetic protein family profiles against the human saliva data set SRS013942.
| Name | #Pfams | HMM-GRASPx | RPS-BLAST | UProC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #t.r. | r.P. | #t.c. | c.P. | #t.r. | r.P. | #t.c. | c.P. | #t.r. | r.P. | #t.c. | c.P. | ||
| Bacteriocin | 13 | 29 | 182 | 69.2 | 49.3 | 56 | 0.2 | 17 | 0.1 | ||||
| B. Lactone | 1 | 6 | 2 | 13.0 | 16.0 | 2 | 0.0 | 1 | 0.0 | ||||
| H. Lactone | 1 | 9 | 1 | 10.3 | 11.4 | 0 | 0.0 | 0 | 0.0 | ||||
| Lanti pep. | 6 | 21 | 46 | 82.1 | 73.3 | 36 | 0.1 | 13 | 0.0 | ||||
| NRPS | 3 | 479 | 86.5 | 6,125 | 98.7 | 13,786 | 32.8 | 475 | 2.4 | ||||
| Oligo sac. | 3 | 1,315 | 15,463 | 94.4 | 27,232 | 48.7 | 1,372 | 7.6 | |||||
| PKS | 1 | 48 | 23.8 | 10 | 18.5 | 0 | 0.0 | 0 | 0.0 | ||||
| Terpene | 3 | 411 | 94.0 | 92 | 86.8 | 19 | 0.1 | 7 | 0.0 | ||||
| Thiopeptide | 1 | 19 | 333 | 74.0 | 47.1 | 350 | 0.8 | 17 | 0.0 | ||||
“#Pfams” indicates the number of Pfam families involved in the biosynthesis of the corresponding secondary metabolite. Abbreviations: “B. Lactone”: Butyrolactones; “H. Lactone”: Homoserine lactone; “Lanti pep.”: Lantipeptides; “Oligo sac.”: Oligosaccharide. “#t.r.”: number of true reads; “r.P.”: read-level precision (percentage); “#t.c.”: number of true contigs; “c.P.”: contig-level precision (percentage). The highest performances among all programs are bolded.
Fig 2AMR protein-family profiles from the six supragingival and six stool samples as predicted by HMM-GRASPx and HMMER3.
(A) AMR Profile predicted by HMM-GRASPx. (B) AMR Profile predicted by HMMER3. Hierarchical clustering was performed sample-wise (column-wise). Color bars on the left-hand-sides of the heat maps indicate AMR classification by RESFAM (bottom right legend). Abundance values (RPKM) were row-wise normalized into Z-scores (bottom right color key).
Performance of HMM-GRASPx and HMMER3 for searching biosynthetic protein family profiles against the in vitro human oral plaque biofilm MT data sets.
| Sample | #Reads | HMM-GRASPx | HMMER3 | ||||
|---|---|---|---|---|---|---|---|
| Recall | Precision | F-measure | Recall | Precision | F-measure | ||
| 22,280,139 | 83.0 | 15.8 | 26.9 | ||||
| 17,687,166 | 88.4 | 16.4 | 27.8 | ||||
| 23,106,973 | 88.0 | 16.0 | 27.3 | ||||
| 31,229,289 | 74.2 | 16.2 | 27.5 | ||||
| 43,702,127 | 78.5 | 16.7 | 28.4 | ||||
| 15,550,857 | 84.1 | 16.7 | 28.2 | ||||
| 37,336,599 | 86.8 | 14.5 | 25.1 | ||||
| 42,394,221 | 77.3 | 14.5 | 25.1 | ||||
Fig 3Comparison of HMM-GRASPx- and HMMER3-generated results from the in vitro human oral plaque biofilm MT data set.
(A) Abundances correlation between HMM-GRASPx predictions and the ground-truth (i.e. BWA). The x- and y-axis indicate the number of homologous reads recruited by the corresponding programs. Colors indicate different conditions where the libraries were constructed (blue for 0hr/pH 7.0, red for 6hr/pH 4.2, and yellow for 9hr/pH 5.2). Different marks indicate different replicates. (B) Abundances correlation between HMMER3 predictions and the ground-truth. (C) Venn diagram showing the overlap between detected DE genes using HMM-GRASPx predictions (green), HMMER3 predictions (red), and the ground-truth homologous reads (blue).
Fig 4Targeted assembly of secondary metabolite synthesizing protein families from the in vitro human oral biofilm MT data set.
Only the protein contigs longer than 60aa and nucleotide contigs longer than 180nt were considered. (A) Normalized N50 for protein assembly. (B) Normalized N50 for nucleotide assembly. For (A) and (B), red color indicates the performance of HMMER3 and blue color indicates HMM-GRASPx. The x-axes indicate assembly cases and were sorted based on the decreasing values of the HMMER3 performance and then the decreasing values of the HMM-GRASPx performance. Assembly cases without corresponding red bars indicate that no contig was assembled using HMMER3 predictions. (C) Log2 fold change for the N50 measures in assembly cases where contigs can be assembled using either HMMER3 or HMM-GRASPx prediction. (D) Log2 fold change for the number of assembled reads in assembly cases where contigs can be assembled using either HMMER3 or HMM-GRASPx prediction.