| Literature DB >> 27527218 |
Gon Carmi1, Alexander Bolshoy2.
Abstract
The existence of multiple copies of genes is a well-known phenomenon. A gene family is a set of sufficiently similar genes, formed by gene duplication. In earlier works conducted on a limited number of completely sequenced and annotated genomes it was found that size of gene family and size of genome are positively correlated. Additionally, it was found that several atypical microbes deviated from the observed general trend. In this study, we reexamined these associations on a larger dataset consisting of 1484 prokaryotic genomes and using several ranking approaches. We applied ranking methods in such a way that genomes with lower numbers of gene copies would have lower rank. Until now only simple ranking methods were used; we applied the Kemeny optimal aggregation approach as well. Regression and correlation analysis were utilized in order to accurately quantify and characterize the relationships between measures of paralog indices and genome size. In addition, boxplot analysis was employed as a method for outlier detection. We found that, in general, all paralog indexes positively correlate with an increase of genome size. As expected, different groups of atypical prokaryotic genomes were found for different types of paralog quantities. Mycoplasmataceae and Halobacteria appeared to be among the most interesting candidates for further research of evolution through gene duplication.Entities:
Keywords: Halophiles; Mycobacterium leprae; Mycoplasmas; Orientia; combinatorial optimization; comparative genomics; genome size; number of paralogs
Year: 2016 PMID: 27527218 PMCID: PMC5041006 DOI: 10.3390/life6030030
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Figure 1(a) Dimension of fraction of paralogous families is plotted versus genome size. Input dataset consists of 1484 prokaryotic genomes. Kendall rank correlation between p.i. and genome size is equal to 0.72. Regression polynomial function is 0.25 + 2.69x − 0.71x2 + 0.47x3 − 0.12x4. Regression is found to be statistically significant (F statistic = 1790.059, p-value < 2.2 × 10−16). Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses; (b) The same as (a) showing only genomes of species from the Vibrio genus.
Figure 2(a) Genomic average size of gene-families versus genome size. Kendall rank correlation between average family size and genome size is equal to 0.77. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 176.698, p-value < 2.2 × 10 −16). Regression polynomial function is 1.66 + 13.92x + 0.82x2 + 0.3x3 − 0.47x4 − 0.02x5 + 0.87x6 + 0.41x7; (b) Showing genomes of the species from the Mycobacterium genus (black rectangles and rectangles with crosses mark atypical genomes) and genomes of the species from the Halobacteria class (red circles and circles with crosses mark atypical genomes).
Figure 3(a) Genome ranking versus genome size for the same genomes. Ranking of prokaryotic genomes is performed applying a sorting procedure to the complete input matrix. Kendall rank correlation between a genome rank and its genome size is equal to 0.78. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 1672.68, p-value < 2.2 × 10−16). Regression polynomial function is 741.36 + 14769.57x − 3783.31x2 − 641.64x3 + 880.83x4 − 344.26x5 + 277.53x6; (b) Shows (magnifies) the genomes of the species from the Halobacteria class.
Figure 4Relative frequency of larger gene families mp =
Atypical genomes according to a paralog index measure 1.
| Size (Mb) | Atypical Genomes | ||
|---|---|---|---|
| 34.8 | 0.072 | 1.516 | |
| 38.3 | 0.094 | 2.127 | |
| 106 | 0.120 | 2.279 | |
| 379 | 0.158 | 3.168 | |
| 207 | 0.166 | 3.268 | |
| 208 | 0.166 | 3.268 | |
| 611 | 0.178 | 3.286 | |
| 769 | 0.198 | 3.939 | |
| 763 | 0.194 | 4.033 | |
| 385 | 0.182 | 4.171 | |
| 820 | 0.204 | 4.236 | |
| 1483 | 0.193 | 4.494 | |
| 787 | 0.211 | 4.532 | |
| 1072 | 0.225 | 5.008 | |
| 1281 | 0.231 | 5.166 | |
| 1293 | 0.234 | 5.969 |
1 p.i.—paralog index, Rank—is an averaged rank calculated for multiple runs of the S-ranking procedure. Genomes are sorted by ascending size of genome for easier comparison with Figure 1.
Partial list of atypical genomes according to average number of paralogs 1.
| Size (Mb) | Atypical Genomes | ||
|---|---|---|---|
| 246.8 | 1.521 | 0.853 | |
| … | |||
| 1225.1 | 1.915 | 2.809 | |
| 1233.4 | 1.936 | 2.821 | |
| 1235.3 | 2.008 | 2.848 | |
| 1091.1 | 1.878 | 2.914 | |
| 1240.8 | 2.067 | 3.420 | |
| 1306.5 | 2.071 | 3.668 | |
| 1260.9 | 2.036 | 3.752 | |
| 1419.5 | 2.378 | 3.889 | |
| … | |||
| 948.4 | 2.228 | 4.644 | |
| 1074.7 | 2.277 | 4.830 | |
| 1211.8 | 2.293 | 5.067 | |
| 1074.9 | 2.495 | 5.475 | |
| 1275.8 | 2.399 | 5.548 | |
| 1303.6 | 2.491 | 5.620 | |
| 1306.9 | 2.483 | 5.705 | |
| 1320.9 | 2.567 | 5.737 | |
| 1319.4 | 2.582 | 6.048 | |
| 1449.2 | 2.938 | 6.988 | |
| … | |||
| 1477.8 | 3.463 | 10.237 |
1 Rank—is an averaged rank calculated for multiple runs of the S-ranking procedure; ave—average number of paralogs.
Partial list of atypical genomes according to S-Rank.
| Size (Mb) | Atypical Genomes | |
|---|---|---|
| 622.8 | 1.591 | |
| … | ||
| 803.4 | 2.001 | |
| 811.5 | 2.014 | |
| 1225.1 | 2.809 | |
| 1233.4 | 2.821 | |
| 1235.3 | 2.848 | |
| 1091.1 | 2.914 | Halophilic archaeon DL31 |
| 1186.8 | 3.261 | |
| 1240.8 | 3.420 | |
| 1235.0 | 3.484 | |
| 1306.5 | 3.668 | |
| 1260.9 | 3.752 | |
| 1419.5 | 3.889 | |
| … | ||
| 1057.6 | 7.750 |
List of atypical genomes according to mp 1.
| Size (Mb) | Atypical Genomes | ||
|---|---|---|---|
| 31.2 | 0.32 | 0.580 | |
| 166.6 | 0.34 | 0.602 | |
| 21.9 | 0.00 | 0.706 | |
| 246.7 | 0.49 | 0.707 | Aster yellows witches broom phytoplasma AYWB |
| 11.5 | 0.04 | 0.792 | |
| 183.6 | 0.31 | 0.799 | |
| 31.8 | 0.39 | 0.816 | |
| 246.8 | 0.49 | 0.853 | Onion yellows phytoplasma OY M |
| 167.2 | 0.48 | 0.880 | |
| 192.8 | 0.37 | 0.948 | |
| 199.5 | 0.41 | 0.964 | |
| 191.8 | 0.34 | 0.978 | |
| 297.6 | 0.37 | 1.007 | |
| 186.2 | 0.45 | 1.119 | |
| 77.1 | 0.09 | 1.161 | |
| 420.9 | 0.45 | 1.317 | |
| 411.0 | 0.48 | 1.580 | |
| 358.7 | 0.44 | 1.667 | |
| 481.4 | 0.46 | 1.796 | |
| 196.2 | 0.20 | 1.887 | |
| 156.9 | 0.21 | 2.145 | |
| 158.0 | 0.22 | 2.153 | |
| 154.3 | 0.23 | 2.154 | |
| 160.0 | 0.22 | 2.184 | |
| 166.2 | 0.24 | 2.272 | |
| 105.6 | 0.24 | 2.279 | |
| 859.3 | 0.54 | 2.702 | |
| 1131.5 | 0.55 | 2.992 | |
| 1483.0 | 0.33 | 4.494 |
1
Mp =
Distribution of gene-family sizes of Mycoplasmataceae 1.
| Genome Name | Np | NO | NC | 1 | 2 | 3 | >3 | |
|---|---|---|---|---|---|---|---|---|
| 742 | 267 | 475 | 335 | 42 | 10 | 4 | 14/56 | |
| 813 | 291 | 522 | 332 | 42 | 15 | 10 | 25/67 | |
| 631 | 214 | 417 | 347 | 20 | 3 | 3 | 6/26 | |
| 801 | 279 | 522 | 346 | 37 | 11 | 11 | 22/59 | |
| 765 | 239 | 526 | 354 | 43 | 9 | 7 | 16/59 | |
| 812 | 236 | 576 | 390 | 58 | 10 | 7 | 17/65 | |
| 692 | 272 | 420 | 323 | 39 | 0 | 4 | 4/43 | |
| 689 | 199 | 490 | 380 | 37 | 6 | 4 | 10/47 | |
| 797 | 247 | 550 | 388 | 38 | 8 | 12 | 20/58 | |
| 1049 | 459 | 590 | 383 | 35 | 11 | 18 | 29/64 | |
| 763 | 274 | 489 | 357 | 43 | 4 | 6 | 10/53 | |
| 475 | 91 | 384 | 330 | 15 | 4 | 3 | 7/22 | |
| 1545 | 1258 | 287 | 230 | 16 | 2 | 1 | 3/19 | |
| 523 | 145 | 378 | 315 | 21 | 1 | 4 | 5/26 | |
| 691 | 254 | 437 | 331 | 39 | 1 | 3 | 4/43 | |
| 657 | 214 | 443 | 333 | 38 | 1 | 4 | 5/43 | |
| 657 | 186 | 471 | 344 | 44 | 2 | 4 | 6/50 | |
| 658 | 194 | 464 | 339 | 36 | 7 | 2 | 9/45 | |
| 882 | 316 | 566 | 398 | 50 | 9 | 8 | 17/67 | |
| 633 | 183 | 450 | 370 | 26 | 6 | 2 | 8/34 | |
| 922 | 303 | 619 | 400 | 55 | 6 | 14 | 20/75 | |
| 1017 | 325 | 692 | 397 | 55 | 15 | 16 | 31/86 | |
| 1037 | 379 | 658 | 447 | 54 | 10 | 14 | 30/84 | |
| 648 | 203 | 445 | 359 | 19 | 6 | 6 | 12/31 | |
| 782 | 222 | 560 | 387 | 36 | 8 | 17 | 25/61 | |
| 650 | 176 | 474 | 379 | 34 | 4 | 3 | 7/41 | |
| 845 | 592 | 253 | 209 | 14 | 0 | 2 | 2/16 | |
| 794 | 553 | 241 | 212 | 11 | 1 | 1 | 2/13 | |
| 659 | 180 | 479 | 357 | 33 | 10 | 5 | 15/48 | |
| 609 | 196 | 413 | 346 | 25 | 1 | 2 | 3/28 | |
| 614 | 173 | 441 | 360 | 29 | 3 | 2 | 5/34 | |
| 646 | 230 | 416 | 342 | 25 | 3 | 2 | 5/30 |
1 NP—number of proteins; NO—number of ORFans; NC—number of COG-annotated proteins; M.—Mycoplasma; U.—Ureaplasma.
Pairwise partial Kendall correlation between all ranking methods 1.
| Genome Size | |||||
|---|---|---|---|---|---|
| 0.57 | 0.57 | 0.46 | 0.72 | ||
| 0.57 | 0.61 | 0.52 | 0.77 | ||
| 0.57 | 0.61 | 0.38 | 0.78 | ||
| 0.46 | 0.52 | 0.38 | 0.66 |
1 All correlations were controlled for genome size and are statistically significant (p-value < 2.2 × 10−16).
Partial list of atypical genomes according to average number of gene copies.
| Size (Mb) | Atypical Genomes | ||
|---|---|---|---|
| 861 | 1.827 | 6.196 | |
| 1341 | 2.024 | 7.215 | |
| 1058 | 1.961 | 7.750 | |
| 1411 | 2.370 | 9.004 | |
| 1319 | 2.349 | 9.446 |
The indices of GFE of fictional data.
| Genome | ORFans | COGs | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||||||
| 10 | 1 | 1 | 1 | 1 | 16 | 0.2 | 4 | 1 | 1.0 | |
| 8 | 1 | 1 | 2 | 2 | 4 | 0.6 | 2 | 2 | 0.3 | |
| 20 | 1 | 1 | 1 | 6 | 6 | 0.4 | 3 | 3 | 0.7 | |