| Literature DB >> 26163675 |
Evgeni Bolotin1, Ruth Hershberg2.
Abstract
Some of the most dangerous pathogens such as Mycobacterium tuberculosis and Yersinia pestis evolve clonally. This means that little or no recombination occurs between strains belonging to these species. Paradoxically, although different members of these species show extreme sequence similarity of orthologous genes, some show considerable intraspecies phenotypic variation, the source of which remains elusive. To examine the possible sources of phenotypic variation within clonal pathogenic bacterial species, we carried out an extensive genomic and pan-genomic analysis of the sources of genetic variation available to a large collection of clonal and nonclonal pathogenic bacterial species. We show that while nonclonal species diversify through a combination of changes to gene sequences, gene loss and gene gain, gene loss completely dominates as a source of genetic variation within clonal species. Indeed, gene loss is so prevalent within clonal species as to lead to levels of gene content variation comparable to those found in some nonclonal species that are much more diverged in their gene sequences and that acquire a substantial number of genes horizontally. Gene loss therefore needs to be taken into account as a potential dominant source of phenotypic variation within clonal bacterial species.Entities:
Keywords: bacterial evolution; clonal pathogens; gene loss; pangenome; sources of variation
Mesh:
Substances:
Year: 2015 PMID: 26163675 PMCID: PMC4558853 DOI: 10.1093/gbe/evv135
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FGenome size correlates very well with gene number, across bacterial genomes. Each dot within this graph represents a single bacterial genome. Mycobacterium leprae is highlighted as a clear outlier to this trend. The genome of M. leprae is very large relative to its functional gene count, due to uncharacteristically high maintenance of pseudogenes.
FPangenome plots of ten recombining bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences; Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.
FPangenome plots of four clonal bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences; Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.
Levels of Variation in the Sequences of Genes and in Gene Content within the 14 Examined Species and Correlations between These Types of Variation
| Species Name | No. of Analyzed Genomes | Minimal AAI | Maximal Fluidity | Spearman’s | % of Increase in Genomic Fluidity per 1% Decrease in AAI | % of Increase in Genomic Fluidity per 1% Decrease in AAI for Strains with AAI >99.5% |
|---|---|---|---|---|---|---|
| 13 | 93.05 | 0.225 | −0.843 ( | 1.51 | N/A | |
| 13 | 95.72 | 0.279 | −0.812 ( | 4.89 | 7.32 | |
| 10 | 95.60 | 0.191 | −0.638 ( | 2.65 | N/A | |
| 13 | 98.00 | 0.147 | −0.474 ( | 4.95 | N/A | |
| 15 | 98.59 | 0.103 | −0.746 ( | 4.03 | 15.46 | |
| 60 | 97.37 | 0.287 | −0.677 ( | 5.66 | 13.65 | |
| 12 | 98.84 | 0.168 | −0.748 ( | 6.33 | 8.95 | |
| 10 | 95.98 | 0.216 | −0.417 ( | 2.34 | N/A | |
| 26 | 96.06 | 0.133 | −0.825 ( | 1.71 | 10.14 | |
| MTBC | 20 | 99.62 | 0.140 | −0.820 ( | 26.16 | 26.16 |
| 14 | 97.52 | 0.162 | −0.785 ( | 5.67 | N/A | |
| 13 | 94.57 | 0.169 | −0.759 ( | 1.77 | N/A | |
| 21 | 97.67 | 0.187 | −0.567 ( | 5.56 | N/A | |
| 12 | 99.65 | 0.164 | −0.632 ( | 31.31 | 31.31 |
Note.—N/A denotes species that have few or no pairs of strains with AAI >99.5%.
aSpearman’s ρ coefficient calculated for correlation between AAI and genomic fluidity for considered bacterial species. P-value is shown in parentheses.
bClonal species.
Summary of Pangene Frequencies within the Corrected Pangenomes of the 14 Studied Species
| Species | Pangenome Size | “Rare” Pangenes | “Near Core”Pangenes | “Core” Pangenes |
|---|---|---|---|---|
| 9,769 | 46.67 | 5.39 | 34.34 | |
| 2,465 | 33.02 | 12.78 | 43.53 | |
| 5,417 | 39.47 | 9.71 | 43.53 | |
| 3,122 | 30.88 | 3.97 | 53.78 | |
| 2,209 | 5.57 | 16.79 | 70.03 | |
| 11,305 | 65.61 | 14.14 | 13.51 | |
| 1,484 | 3.23 | 12.74 | 69.00 | |
| 2,681 | 40.69 | 12.94 | 39.09 | |
| 4,197 | 32.14 | 6.70 | 53.80 | |
| MTBC | 3,752 | 1.81 | 17.11 | 76.28 |
| 2,463 | 30.00 | 7.96 | 52.46 | |
| 8,585 | 33.72 | 8.77 | 51.37 | |
| 2,960 | 37.13 | 9.53 | 42.23 | |
| 3,690 | 3.69 | 14.63 | 68.64 |
aThe corrected pangenome is constructed by generating the pangenome based on annotated protein-coding genes and then removing pangenes if they are found in less than 50% of strains at the protein level but in more than 75% of strains at the whole DNA level.
bNumber of pangenes (orthologous gene clusters) within pangenome.
c% of pangenes that are found in less than 25% of strains of a species.
d% of pangenes that are found in over 75% of strains of a species, but are not found in all strains.
e% of pangenes found in all strains of a species.
fClonal species.
Maximal Percentage of Lost “Near Core” Genes
| Species | Maximal Gene Loss (%) | Maximal Gene Loss (%) |
|---|---|---|
| 3.3 | 2.0 | |
| 10.2 | 5.8 | |
| 5.5 | 3.3 | |
| 1.7 | 1.6 | |
| 4.1 | 3.7 | |
| 6.1 | 4.9 | |
| 9.2 | 4.4 | |
| 3.8 | 3.3 | |
| MTBC | 9.8 | 3.2 |
| 2.7 | 3.6 | |
| 7.0 | 8.8 | |
| 5.1 | 2.7 | |
| 6.9 | 5.6 |
aFor each strain we calculated the percentage of genes that were lost from that genome as 100*L/T, where L is the number of “near core” core pangenes that are absent from the genome and T is the total number of genes present within that genome, that were used to construct the species pangenome (see Materials and Methods). Given in this table are the maximal values obtained for each species.
bPangenomes were generated using the annotation provided by the NCBI alongside each genome sequence
cPangenomes were generated using the annotations obtained by using RAST. Annotations were length corrected (see text).
dClonal species.
Frequencies of Pseudogene Maintenance
| Species | PCP (%) |
|---|---|
| 40.17 | |
| 71.84 | |
| 51.79 | |
| 47.47 | |
| 96.74 | |
| 44.67 | |
| 95.00 | |
| 75.30 | |
| 39.07 | |
| MTBC | 84.56 |
| 60.99 | |
| 37.67 | |
| 71.49 | |
| 67.56 |
aThe PCP metric was calculated by focusing on pangenes found, at the protein level, in 75% or more of strains within each species, but not in all strains of that species. PCP was calculated as the proportion of genomes from which such a pangene was absent at the protein level, but was still found at the DNA level.
bClonal species.