| Literature DB >> 32023871 |
Valery V Panyukov1,2, Sergey S Kiselev2,3, Olga N Ozoline2,3.
Abstract
The need for a comparative analysis of natural metagenomes stimulated the development of new methods for their taxonomic profiling. Alignment-free approaches based on the search for marker k-mers turned out to be capable of identifying not only species, but also strains of microorganisms with known genomes. Here, we evaluated the ability of genus-specific k-mers to distinguish eight phylogroups of Escherichia coli (A, B1, C, E, D, F, G, B2) and assessed the presence of their unique 22-mers in clinical samples from microbiomes of four healthy people and four patients with Crohn's disease. We found that a phylogenetic tree inferred from the pairwise distance matrix for unique 18-mers and 22-mers of 124 genomes was fully consistent with the topology of the tree, obtained with concatenated aligned sequences of orthologous genes. Therefore, we propose strain-specific "barcodes" for rapid phylotyping. Using unique 22-mers for taxonomic analysis, we detected microbes of all groups in human microbiomes; however, their presence in the five samples was significantly different. Pointing to the intraspecies heterogeneity of E. coli in the natural microflora, this also indicates the feasibility of further studies of the role of this heterogeneity in maintaining population homeostasis.Entities:
Keywords: alignment-free algorithms; bacterial genomes; genome barcodes; human microbiome; k-mers; metagenomes; phylogenetic trees; phylotyping; taxonomic profiling
Mesh:
Year: 2020 PMID: 32023871 PMCID: PMC7037511 DOI: 10.3390/ijms21030944
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The size of the “unique genomes” represented by k-mers of different length for eight individual E. coli chromosomes, and the degree of their intersection exemplified by three indicated genomes. (a) The solid lines show the normalized per 1 Mbp in each genome number of k-mers (N), found in the chromosomes of E. coli (strains: K-12 MG1655, ETEC H10407, O26:H11 str. 11368, ABU 83972, APEC O78, str. 042, O157:H7 str. EC4115 and O7:K1 str. CE10) that are absent in the nucleotide sequences of the reference database. Dashed lines show the increment curves plotted for ΔN/Δk. (b) Venn diagram illustrating the intersection between the sets of 18-mers identified in the genomes of two bacteria from group A (E. coli K-12 MG1655 and ETEC H10407) and the E. coli O26:H11 str. 11368, belonging to group B1. The number of unique 18-mers in each genome, the size of their common set and the intersection between the two sets of group A are indicated without normalization. The diagram was created using a Venn Diagram Maker [54].
Figure 2Phylogenetic tree for 124 E. coli strains inferred from concatenated aligned sequences of 27 genes in the IQ-TREE program [70] using the maximum likelihood method. The optimal model for nucleotide substitution was GTR+G+I (the general time-reversible model assuming a fixed portion of invariant sites and evolutionary rate differences described by the gamma-distribution). The branch support level shown in percentage was estimated based on 2000 iterations with ultrafast bootstrap approximation [71]. The scale bar corresponds to the number of nucleotide substitutions per site. The color code corresponds to eight indicated phylogroups. The names of all strains are indicated near corresponding branches and separated with comma for identical sequences in group B1.
Figure 3Phylogenetic tree constructed by the neighbor-joining method in the MEGA X program [73]. The tree was inferred from the pairwise distance matrix for 124 sets of 18-mers unique to the genera Escherichia/Shigella and was identical to the tree constructed on the basis of 22-mers. The set of marker 18-mers from the genome of Escherichia albertii KF1 was used as the outgroup sample. The scale bar shows the Sorensen distance as a percentage. The same color code as in Figure 2 denotes the clades of eight phylogroups.
Statistics for phylogroup-specific sets of 22-mers for 124 genomes of E. coli.
| Phylogroup | Number of Strains | Range in Size Variation for Sets of Marker 22-mers in Individual Genomes | Size of Core Sets | Size of Cumulative Sets | |
|---|---|---|---|---|---|
| Maximal | Minimal | ||||
| A | 17 | 143,024 | 24,726 | 232 | 1,055,426 |
| B1 | 25 | 161,117 | 72,365 | 143 | 1,600,260 |
| B2 | 23 | 515,073 | 379,072 | 29,343 | 2,539,510 |
| C | 14 | 148,829 | 56,030 | 8444 | 586,272 |
| D | 11 | 368,049 | 243,470 | 1298 | 1,778,210 |
| E | 13 | 463,307 | 292,542 | 10,213 | 1,582,445 |
| F | 11 | 355,845 | 248,277 | 20,640 | 1,159,521 |
| G | 10 | 235,711 | 146,632 | 51,125 | 599,863 |
Number of E. coli group-specific 22-mers, found in selected metagenomes.
| SRA ID of Metagenome (N) | A | B1 | B2 | C | D | E | F | G |
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| SRX187518 (1) | 17 | 15 | 26 | 5 | 12 | 14 | 8 | 0 |
| SRX187521 (2) | 9382 | 223 | 5060 | 33 | 559 | 110 | 182 | 39 |
| SRX187522 (3) | 31 | 191 | 11,566 | 17 | 34 | 273 | 83 | 15 |
| SRX187523 (4) | 29 | 29 | 62 | 10 | 21 | 22 | 17 | 6 |
|
| ||||||||
| SRX187524 (5) | 105 | 307 | 36,698 | 64 | 212 | 54 | 279 | 305 |
| SRX187525 (6) | 10 | 28 | 81 | 6 | 22 | 24 | 23 | 15 |
| SRX187526 (7) | 211 | 11388 | 38,389 | 3435 | 774 | 223 | 1147 | 420 |
| SRX187527 (8) | 944 | 4418 | 462,763 | 1292 | 1019 | 1104 | 1713 | 2024 |
Figure 4Phylogroup-dependent taxonomy of metagenomes from four healthy individuals (numbers 1–4) and four patients with Crohn’s disease (numbers 5–8). Panel (a) shows the size distribution for cumulative sets of unique 22-mers (colored symbols) and selected metagenomes numbered in the same way as in panel “b” (open symbols). Panel (b) demonstrates the number of sequence reads assigned to a particular group, normalized by the size of cumulative sets of 22-mers (Table 1) and the number of reads in metagenomes. Numerical values in both cases are presented as their natural logarithms.