| Literature DB >> 33692198 |
Matthew K O'Shea1, Alan McNally2, Danai Papakonstantinou3, Steven J Dunn3, Simon J Draper4, Adam F Cunningham5.
Abstract
Tuberculosis (TB) is responsible for millions of deaths annually. More effective vaccines and new antituberculous drugs are essential to control the disease. Numerous genomic studies have advanced our knowledge about M. tuberculosis drug resistance, population structure, and transmission patterns. At the same time, reverse vaccinology and drug discovery pipelines have identified potential immunogenic vaccine candidates or drug targets. However, a better understanding of the sequence variation of all the M. tuberculosis genes on a large scale could aid in the identification of new vaccine and drug targets. Achieving this was the focus of the current study. Genome sequence data were obtained from online public sources covering seven M. tuberculosis lineages. A total of 8,535 genome sequences were mapped against M. tuberculosis H37Rv reference genome, in order to identify single nucleotide polymorphisms (SNPs). The results of the initial mapping were further processed, and a frequency distribution of nucleotide variants within genes was identified and further analyzed. The majority of genomic positions in the M. tuberculosis H37Rv genome were conserved. Genes with the highest level of conservation were often associated with stress responses and maintenance of redox balance. Conversely, genes with high levels of nucleotide variation were often associated with drug resistance. We have provided a high-resolution analysis of the single-nucleotide variation of all M. tuberculosis genes across seven lineages as a resource to support future drug and vaccine development. We have identified a number of highly conserved genes, important in M. tuberculosis biology, that could potentially be used as targets for novel vaccine candidates and antituberculous medications.IMPORTANCE Tuberculosis is an infectious disease caused by the bacterium Mycobacterium tuberculosis In the first half of the 20th century, the discovery of the Mycobacterium bovis BCG vaccine and antituberculous drugs heralded a new era in the control of TB. However, combating TB has proven challenging, especially with the emergence of HIV and drug resistance. A major hindrance in TB control is the lack of an effective vaccine, as the efficacy of BCG is geographically variable and provides little protection against pulmonary disease in high-risk groups. Our research is significant because it provides a resource to support future drug and vaccine development. We have achieved this by developing a better understanding of the nucleotide variation of all of the M. tuberculosis genes on a large scale and by identifying highly conserved genes that could potentially be used as targets for novel vaccine candidates and antituberculous medications.Entities:
Keywords: Mycobacterium tuberculosis; SNPs; TB; drug targets; single nucleotide polymorphisms; single-nucleotide variation; tuberculosis; tuberculosis vaccines; vaccine candidates
Mesh:
Substances:
Year: 2021 PMID: 33692198 PMCID: PMC8546714 DOI: 10.1128/mSphere.01224-20
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 4.389
FIG 1Distribution of the sequence variation at a genomic position and CDS level. (A) Variants*, number of genomes with a variant across H37Rv. Mapping of 8,535 genomes against H37Rv demonstrated that 92.2% of H37Rv genomic positions were conserved upon comparison. A small number of positions contained a large amount of genomes containing a variant (1 to 10 genomes had variants in 7.13% of genomic positions of H37Rv, 10 to 100 genomes had variants in 0.49% of H37Rv genomic positions, and 100 to 8,530 genomes had variants in 0.16% of H37Rv genomic positions. (B) Cumulative variants†, cumulative number of variants across the genomes at a CDS level. CDS from 1 to 3906 and their total number of variants across the data set. At a CDS level, all coding sequences have some degree of variation. Mobile elements, repeat regions, transposases, and RNAs were excluded from this analysis.
FIG 2Variant hot spots within M. tuberculosis genes gidB (A), fadE33 (B), esxO (C), and gyrA (D). Four representative examples of genes with high variation: gidB, a gene associated with streptomycin resistance (6); fadE33 (a member of the fadE family), which plays a role in cholesterol metabolism (25); esxO, which belongs to the ESAT-6 group (16); and gyrA, associated with quinolone drug resistance (6). (A) gidB. The majority of genomic positions (POS) within gidB are conserved. However, 3,690 genomes had a variant at position 4407588, 1,811 genomes had a variant at position 4407927, and 1,734 genomes had a variant at position 4408156. (B) fadE33. The majority of genomic positions within fadE33 are conserved. However, 7,603 genomes had a variant at position 4005607, and 279 genomes had a variant at position 4005335. (C) esxO. The majority of genomic positions within esxO are conserved. Examples of highly variable genomic areas within esxO are positions 2625924 and 2626095. Specifically, 3,561 genomes had a variant at position 2625924 and 1,088 genomes had a variant at position 2626095. (D) gyrA. gyrA, which is associated with drug resistance to quinolones (6), is 2,517 bp (bp) long, and across the analyzed population, the majority of genomic positions within gyrA are conserved. This gene exhibits variation only at certain positions, some of which are known to be associated with quinolone resistance (e.g., 7,345 genomes had a variant at position 7585). In addition, 7,609 and 769 genomes had a variant at genomic positions 9304 and 9143, respectively.
FIG 3Distribution of the single-nucleotide variation of the genes present in the 5th (A) and 95th (B) percentiles. (A) Fifth percentile genes. Of the genes with functional annotation to H37Rv, the gene with the lowest number of variants across the analyzed genomes in the 5th percentile was fdxC (0.05 genomes containing mutations/bp), which encodes a ferredoxin. It was noted that toxin-antitoxin genes of group II are predominant in the 5th percentile. This graph does not contain genes that were not genetically characterized with reference to H37Rv (i.e., hypothetical proteins). Information on these genes can be found in Data Set S1 in the supplemental material. Please note that in order to remove potential bias, the results were normalized by gene length (mutations/bp). (B) Ninety-fifth percentile genes. The esxO gene is the first genetically characterized gene with the highest variation (24.3 genomes containing mutations/bp) across the analyzed genomes. Genes associated with drug resistance (e.g., lppB, lppA, gidB), fadD and fadE families (e.g., fadE33, fadE32, fadH), and ESAT-6 genes are present in the 95th percentile. This graph does not contain genes that were not genetically characterized with reference to H37Rv (i.e., hypothetical proteins). Information on these genes can be found in Data Set S1 in the supplemental material. Please note that in order to remove potential bias, the results were normalized by gene length (mutations/bp).
Protein prediction in the 95th percentile
| Category | COG(s) | Gene(s) | Protein information |
|---|---|---|---|
| Cellular processes and signaling [D],[M],[N],[O],[T],[U],[V],[W],[Y],[Z] | D |
| Antitoxin, TA group |
| M | Associated with streptomycin resistance | ||
| M |
| Peptidoglycan biosynthesis | |
| O |
| Intermediary metabolism and respiration | |
| T |
| Regulatory proteins | |
| T |
| Toxin, TA group | |
| U | Probable transmembrane transporter | ||
| Information storage and processing [A],[B],[J],[K],[L] | J |
| Information pathways |
| J |
| CMP-type deaminase domain protein | |
| K |
| Regulatory proteins | |
| K |
| Cold shock protein | |
| L | Associated with quinolone resistance | ||
| L |
| DNA recombination and repair | |
| Metabolism [C],[E],[F],[G],[H],[I],[P],[Q] | C, H | Lipid metabolism | |
| C, G, Q | Intermediary metabolism and respiration | ||
| C, E | Intermediary metabolism and respiration | ||
| E |
| Transmembrane transporter activity | |
| E | Amino acid biosynthesis | ||
| F |
| Purine biosynthesis/purine salvage | |
| F, P | Cell wall and cell processes | ||
| F, I, G, H | Intermediate metabolism and respiration | ||
| H |
| Riboflavin/cobalamin biosynthesis | |
| H, I | Involved in lipid metabolism | ||
| H, I |
| Involved in lipid metabolism | |
| I, H | Intermediate metabolism and respiration | ||
| P |
| Transmembrane transporter activity | |
| P |
| Iron storage protein | |
| Q | Virulence factor, Mce family | ||
| Q |
| Part of mce2 operon | |
| Poorly characterized [R],[S] | S |
| Immunogenic, cell wall and cell processes |
| S/UC |
| Cholesterol catabolism | |
| S.UC |
| TA group | |
| S/UC |
| Possible lipoproteins | |
| Unable to characterize (UC) | S/UC |
| ESAT-6-like protein |
| S/UC |
| ESX-1 secretion system | |
| S/UC |
| Determinant of intrinsic |
Classification of clusters of orthologous protein groups (COGs) in the 95th percentile, combined with information from Mycobrowser (18) and UniProt (19). Additional information from the literature is individually cited within the table. Genes related to basic COG categories (e.g., metabolism) were observed in both percentiles. However, certain families, such as the fadD and fadE genes (e.g., fadE33, fadD5, fadE32), associated with fatty acid and cholesterol metabolism, were observed only in the 95th percentile (17). Genes related to pathogenesis of TB disease (e.g., ESAT-6/ESX genes) and antibiotic resistance (e.g., gyrA, gidB) are present. ESAT-6/ESX family genes were predicted as poorly characterized. The classification-involved proteins encoded by genetically characterized genes with reference to H37Rv. Protein prediction for the noncharacterized genes can be found in Data Set S1 in the supplemental material.
COG subcategories are explained analytically in the legend to Table S2 in the supplemental material.
A number of genes have been identified as high-confidence drug targets (22).
FIG 4Drug and vaccine candidates previously proposed in the literature within the 5th and 95th percentiles. It is striking that a large number of genes demonstrating high single-nucleotide variation (95th percentile) in our data set have been previously proposed in the literature as desirable drug candidates (gray bars) (22). Few of these drug targets have been previously selected due to their location (e.g., tatB) or their biological function (e.g., fadE33) (17, 22). In addition, three genes in the 95th percentile have been previously proposed as potential vaccine candidates (esxW, mpt53, apa) (black bars) (5, 21). In fact, esxW encodes an immunogenic protein, which is present in a current subunit vaccine (ID93/GLA-SE) (5). A smaller number of genes that are highly conserved in our data set (5th percentile) have been previously proposed as drug targets (gray bars). Genes in the 5th percentile, previously proposed as drug targets, have a median ratio of 0.109272 mutations/bp. Genes in the 95th percentile, previously proposed as drug targets, have a median ratio of 6.691533 mutations/bp. Biological functions and functional annotation of the genes are described in Tables 1 and 2.
Protein prediction in the 5th percentile
| Category | COG(s) | Gene(s) | Protein information |
|---|---|---|---|
| Cellular processes and signaling [D],[M],[N],[O],[T], [U],[V],[W],[Y],[Z] | D | Toxin-antitoxin (TA) group | |
| M |
| Carbohydrate biosynthesis | |
| M |
| Essential cell division protein | |
| M |
| Cell surface lipoprotein | |
| O |
| Chaperonin GroES | |
| O | Peroxiredoxin (direct antioxidants) | ||
| T |
| Response regulator | |
| T |
| Virulence and glutamate metabolism | |
| V, M |
| Role in lipid metabolism | |
| Information storage and processing [A],[B],[J],[K],[L] | J, K, L |
| Information pathways |
| K |
| Antitoxin (TA group) | |
| K |
| Transcriptional regulator | |
| K |
| Cold shock protein | |
| K |
| Amino acid biosynthesis | |
| K |
| Regulatory proteins | |
| Metabolism [C],[E],[F],[G],[H],[I],[P],[Q] | C |
| Iron-sulfur proteins |
| C |
| Probable cytochrome oxidase | |
| C |
| Probable rubredoxin | |
| C, E |
| Intermediate metabolism and respiration | |
| E |
| Amino acid transport and metabolism | |
| F |
| Ribonucleotide reductase function | |
| F, G, H, I |
| Intermediate metabolism and respiration | |
| G, Q, I, P |
| Role in lipid metabolism | |
| G, P |
| ABC transporter | |
| H |
| Involved in folate metabolism | |
| H |
| Coenzyme A (CoA) biosynthesis | |
| H | Molybdopterin biosynthesis | ||
| I |
| Mycobactin biosynthesis | |
| P | Sulfate activation pathway | ||
| P |
| Transmembrane protein | |
| Poorly characterized [R],[S] | UC |
| Antitoxin (TA group) |
| S |
| Toxin (TA group) | |
| S |
| TA group | |
| S |
| Multicopper oxidase | |
| Unable to characterize (UC) | UC, S |
| Cell and cell wall-associated processes |
| UC |
| Nucleoid-associated protein Lsr2 | |
| UC |
| Antitoxin (TA group) | |
| UC | Metallothionein | ||
| UC | Acid and phagosome regulated protein | ||
| UC |
| Information pathways |
COG classification of proteins combined with information from Mycobrowser (18) and UniProt (19). Additional information from the literature, which cannot be found in these two databases, is individually cited within the table. Genes related to the metabolism of essential elements for M. tuberculosis survival, such as thiamine (e.g., thiC), and others related to cell envelope and active transport were also observed (e.g., sugC). Genes belonging to the TA family, as well as genes related to metal binding and antioxidant activity, are present in the 5th percentile. The majority of the TA genes are poorly characterized by COGs. The classification-involved proteins were encoded by genetically characterized genes with reference to H37Rv. Protein prediction for the noncharacterized genes can be found in Data Set S1 in the supplemental material.