| Literature DB >> 24523352 |
Qichao Tu1, Zhili He, Jizhong Zhou.
Abstract
Shotgun metagenome sequencing has become a fast, cheap and high-throughput technology for characterizing microbial communities in complex environments and human body sites. However, accurate identification of microorganisms at the strain/species level remains extremely challenging. We present a novel k-mer-based approach, termed GSMer, that identifies genome-specific markers (GSMs) from currently sequenced microbial genomes, which were then used for strain/species-level identification in metagenomes. Using 5390 sequenced microbial genomes, 8 770 321 50-mer strain-specific and 11 736 360 species-specific GSMs were identified for 4088 strains and 2005 species (4933 strains), respectively. The GSMs were first evaluated against mock community metagenomes, recently sequenced genomes and real metagenomes from different body sites, suggesting that the identified GSMs were specific to their targeting genomes. Sensitivity evaluation against synthetic metagenomes with different coverage suggested that 50 GSMs per strain were sufficient to identify most microbial strains with ≥0.25× coverage, and 10% of selected GSMs in a database should be detected for confident positive callings. Application of GSMs identified 45 and 74 microbial strains/species significantly associated with type 2 diabetes patients and obese/lean individuals from corresponding gastrointestinal tract metagenomes, respectively. Our result agreed with previous studies but provided strain-level information. The approach can be directly applied to identify microbial strains/species from raw metagenomes, without the effort of complex data pre-processing.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24523352 PMCID: PMC4005670 DOI: 10.1093/nar/gku138
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Flowchart of GSM identification processes. First, k-mer database (db) construction. K-mer db representing k-mers that show up in two or more microbial strains and all human genome k-mers were constructed by meryl program. K-mer sizes from 18 to 20 were selected. Second, 50-mer GSMs were generated for selected strains/species. GSMs were then mapped with the k-mer db, and mapped GSMs were filtered. Third, all GSMs were searched against all microbial genomes by BLAST, and GSMs having 85% identity with non-target GSMs were also filtered.
Figure 2.Location of the identified GSMs in the genome. (A) strain-specific GSMs; (B) species-specific GSMs. Different colors denote different locations in the genome: blue for GSMs within genes, green for GSMs within intergenic regions, red for GSMs overlapped between a gene and an intergenic region and purple for unannotated genomes.
Figure 3.Specificity and sensitivity evaluation of identified GSMs. (A) Specificity evaluation against recently sequenced genomes. A total of 302 genomes were collected. (B) Specificity evaluation of GSMs targeting microorganisms isolated from different body sites using raw metagenomes reads. GSMs targeting six different body sites (gastrointestinal tract, oral, airways, skin, blood and urogenital tract) were searched with metagenomes from nine different body sites (stool, subgingival plaque, tongue dorsum, throat, palatine tonsils, anterior nares, left retroauricular crease, right retroauricular crease and posterior fornix) using MEGABLAST. Numbers denote the percentages of MEGABLAST hits, with GSMs targeting each body site. (C) Sensitivity evaluation of GSMs using simulated metagenomes from 695 guts microbial strains. Simulated metagenomes at seven different coverages (0.01, 0.03, 0.05, 0.1, 0.25, 0.5 and 0.75) were searched against different number of GSMs per strain (1, 5, 10, 25, 50, 100, 200 and 500). The percentages of identified microbial strains were analyzed.
The list of microbial strains significantly associated with T2D patients with mean normalized hits ≥5 in treatment/control
| Strain | Number of mean normalized hits ± SDOM | |||
|---|---|---|---|---|
| Control | Treatment | |||
| T2D-enriched | ||||
| 4.79 ± 1.72 | 18.12 ± 4.58 | 0.0065 | 0.07 | |
| 3.60 ± 1.02 | 8.40 ± 1.84 | 0.0222 | 0.15 | |
| 3.43 ± 0.40 | 6.58 ± 1.14 | 0.0090 | 0.06 | |
| 29.88 ± 3.39 | 56.74 ± 8.38 | 0.0030 | 0.04 | |
| 9.27 ± 1.90 | 17.08 ± 3.31 | 0.0405 | 0.17 | |
| 2.87 ± 0.57 | 5.90 ± 1.41 | 0.0454 | 0.21 | |
| 4.20 ± 0.55 | 8.84 ± 2.00 | 0.0247 | 0.13 | |
| 15.18 ± 1.96 | 33.66 ± 4.71 | 0.0003 | 0.02 | |
| 4.03 ± 0.54 | 6.18 ± 0.78 | 0.0245 | 0.15 | |
| 3.28 ± 0.53 | 22.50 ± 9.04 | 0.0330 | 0.15 | |
| 1.27 ± 0.42 | 5.47 ± 1.85 | 0.0261 | 0.16 | |
| 17.93 ± 2.34 | 26.61 ± 3.74 | 0.0492 | 0.15 | |
| 4.51 ± 0.93 | 8.04 ± 1.05 | 0.0124 | 0.06 | |
| 2.04 ± 0.31 | 6.65 ± 1.49 | 0.0025 | 0.05 | |
| Control-enriched | ||||
| 10.05 ± 0.85 | 7.35 ± 0.92 | 0.0318 | 0.19 | |
| 5.58 ± 1.15 | 2.91 ± 0.45 | 0.0319 | 0.14 | |
| 7.07 ± 1.05 | 3.50 ± 0.52 | 0.0026 | 0.05 | |
| 20.46 ± 2.30 | 12.75 ± 2.03 | 0.0124 | 0.14 | |
| 204.12 ± 33.5 | 106.57 ± 22.4 | 0.0164 | 0.11 | |
| 58.41 ± 14.61 | 14.00 ± 5.77 | 0.0052 | 0.04 | |
| 15.78 ± 2.14 | 7.30 ± 1.30 | 0.0008 | 0.04 | |
| 34.08 ± 4.61 | 21.76 ± 3.59 | 0.0360 | 0.18 | |
Figure 4.Response ratio analysis of obese/lean-associated microorganisms at the phylum (A) and strain/species level (B). For strain/species-level analysis, only significantly associated ones with normalized hit number ≥5 were displayed. Asterisks refer to microbial strains that did NOT pass Benjamini–Hochberg FDR analysis at a corrected P-value cutoff of 0.05.