| Literature DB >> 34241588 |
C N'Dira Sanoussi1,2,3, Mireia Coscolla4, Boatema Ofori-Anyinam5,6, Isaac Darko Otchere7, Martin Antonio8, Stefan Niemann9,10, Julian Parkhill11,12, Simon Harris11, Dorothy Yeboah-Manu7, Sebastien Gagneux13,14, Leen Rigouts2,3, Dissou Affolabi1, Bouke C de Jong2, Conor J Meehan15,2.
Abstract
Pathogens of the Mycobacterium tuberculosis complex (MTBC) are considered to be monomorphic, with little gene content variation between strains. Nevertheless, several genotypic and phenotypic factors separate strains of the different MTBC lineages (L), especially L5 and L6 (traditionally termed Mycobacterium africanum) strains, from each other. However, this genome variability and gene content, especially of L5 strains, has not been fully explored and may be important for pathobiology and current approaches for genomic analysis of MTBC strains, including transmission studies. By comparing the genomes of 355 L5 clinical strains (including 3 complete genomes and 352 Illumina whole-genome sequenced isolates) to each other and to H37Rv, we identified multiple genes that were differentially present or absent between H37Rv and L5 strains. Additionally, considerable gene content variability was found across L5 strains, including a split in the L5.3 sub-lineage into L5.3.1 and L5.3.2. These gene content differences had a small knock-on effect on transmission cluster estimation, with clustering rates influenced by the selected reference genome, and with potential overestimation of recent transmission when using H37Rv as the reference genome. We conclude that full capture of the gene diversity, especially high-resolution outbreak analysis, requires a variation of the single H37Rv-centric reference genome mapping approach currently used in most whole-genome sequencing data analysis pipelines. Moreover, the high within-lineage gene content variability suggests that the pan-genome of M. tuberculosis is at least several kilobases larger than previously thought, implying that a concatenated or reference-free genome assembly (de novo) approach may be needed for particular questions.Entities:
Keywords: H37Rv; L5.3.2; M. africanum; gene presence/absence; genomic diversity; lineage 5; reference genome; within-lineage variability
Mesh:
Year: 2021 PMID: 34241588 PMCID: PMC8477398 DOI: 10.1099/mgen.0.000437
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Gene content difference between the H37Rv (L4) genome and the complete (PacBio-sequenced) genomes of three L5 strains from Benin, The Gambia and Nigeria
|
Present in | |||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
|
|
|
| |||
|
|
2+3* |
2+ | |||
|
|
34+3* |
34 |
32+ | ||
|
|
|
|
| ||
|
|
|
4189 |
4162 |
4134 |
4126 |
*, includes three genes only present in PcbL5Ben.
†, includes nine genes only present in H37Rv.
‡, includes 10 genes shared by the PcbL5 (Benin, The Gambia, Nigeria) and absent in H37Rv.
Genes in the L5.3.2-Del region (PcbL5Nig-Del) and their function category. None of the 30 genes is an essential gene (Mycobrowser). The 30 genes formed 2 regions: Rv1493 through Rv1509 (L5.3.2-Del region 1) and Rv1511 through Rv1521 (L5.3.2-Del region 2), separated by the gene Rv1510, which is present in the L5.3.2 isolate. All genes in the table are absent from all L5.3.2 genomes (both PacBio- and Illumina-sequenced), except those marked with an asterisk (*), which are present in the PacBio-sequenced genome (PcbL5Nig) but absent in all six Illumina-sequenced genomes, and the one marked with a hash (#) (Rv1492), which is a gene present in all L5.3.2 strains and flanking the L5.3.2-specific deletion
|
Gene name |
Size |
Co-ordinates in H37Rv |
Functional category |
Present in L5 Illumina-sequenced genomes % ( |
|---|---|---|---|---|
|
Rv1492# ( |
1848 bp |
1 682 157–1 684 004 |
Lipid metabolism |
100 (202) |
|
|
2253 bp |
1 684 005–1 686 257 |
Lipid metabolism |
97 (196) |
|
|
303 bp |
1 686 271–1 686 573 |
Virulence, detoxification, adaptation |
97 (196) |
|
|
318 bp |
1 686 570–1 686 887 |
Virulence, detoxification, adaptation |
97 (196) |
|
|
1005 bp |
1 686 884–1 687 888 |
Cell wall and cell processes |
97 (196) |
|
|
1290 bp |
1 687 941–1 689 230 |
Intermediary metabolism and respiration |
97 (196) |
|
|
618 bp |
1 6893 03–1 689 920 |
Intermediary metabolism and respiration |
97 (196) |
|
|
213 bp |
1 690 134–1 690 346 |
Conserved hypothetical protein |
97 (196) |
|
|
399 bp |
16 900 407–1 690 805 |
Conserved hypothetical protein |
97 (196) |
|
|
1029 bp |
1 690 850–1 691 878 |
Intermediary metabolism and respiration |
97 (196) |
|
|
822 bp |
1 691 890–1 692 711 |
Conserved hypothetical protein |
97 (196) |
|
|
900 bp |
1 692 924–1 693 823 |
Unknown |
97 (196) |
|
|
549 bp |
1 693 996–1 694 544 |
Conserved hypothetical protein |
97 (196) |
|
|
600 bp |
1 694 545–1 695 144 |
Conserved hypothetical protein |
97 (196) |
|
|
666 bp |
1 695 281–1 695 946 |
Conserved hypotheticals |
97 (196) |
|
|
501 bp |
1 695 943–1 696 443 |
Unknown |
97 (196) |
|
|
504 bp |
1 697 356–1 697 859 |
Unknown |
97 (196) |
|
|
696 bp |
1 696 727–1 697 422 |
Conserved hypotheticals |
97 (196) |
|
|
1800 bp |
1 698 095–1 699 894 |
Cell wall and cell processes |
97 (196) |
|
|
636 bp |
1 699 866–1 700 228 |
Conserved hypotheticals |
97 (196) |
|
|
882 bp |
1 700 212–1 701 093 |
Unknown |
97 (196) |
|
|
1299 bp |
1 701 295–1 702 593 |
Cell wall and cell processes |
97 (196) |
|
|
1023 bp |
1 703 074–1 704 096 |
Intermediary metabolism and respiration |
97 (196) |
|
|
969 bp |
1 704 093–1 705 061 |
Intermediary metabolism and respiration |
97 (196) |
|
|
732 bp |
1 705 058–1 705 789 |
Conserved hypothetical protein |
97 (196) |
|
|
789 bp |
1 705 807–1 706 595 |
Conserved hypothetical protein |
97 (196) |
|
|
897 bp |
1 706 630–1 707 526 |
Conserved hypothetical protein |
97 (196) |
|
|
1011 bp |
1 707 529–1 708 539 |
Intermediary metabolism and respiration |
97 (196) |
|
|
765 bp |
1 708 871–1 709 635 |
Cell wall and cell processes |
97 (196) |
|
|
960 bp |
1 709 644–1 710 603 |
Conserved hypothetical protein |
97 (196) |
|
|
270 bp |
1 710 733–1 711 002 |
Conserved hypothetical protein |
97 (196) |
|
|
1041 bp |
1 711 028–1 712 068 |
Intermediary metabolism and respiration |
97 (196) |
|
|
1752 bp |
1 712 302–1 714 053 |
Lipid metabolism |
97 (196) |
|
|
3441 bp |
1 714 172–1 717 612 |
Cell wall and cell processes |
97 (196) |
Mapping of Illumina-sequenced genomes of the 202 L5 strains to the H37Rv (L4) genome and complete genomes of 3 L5 strains from Benin, The Gambia and Nigeria (mapping statistics/estimates). The best mapping results (numbers) are written in bold. When the best mapping result has been obtained for PcbL5Nig as the reference, the next best result is also written in bold (as PcbL5Nig compared to the other 2 PcbL5 genomes missed a 30-gene region)
|
H37Rv |
PcbL5Ben |
PcbL5Gam |
PcbL5Nig | |
|---|---|---|---|---|
|
Reads | ||||
|
|
96.9 |
|
97.3 |
96.5 |
|
|
122.3 |
123.2 |
|
123.0 |
|
Bases | ||||
|
|
98.0 |
98.8 |
|
98.2 |
|
|
30 599.0 |
18 916.7 |
14 766.4 |
35 340.4 |
|
|
2209.7 |
529.5 |
|
|
|
|
374.1 |
96.5 |
|
97.9 |
|
|
239.5 |
|
107.6 |
|
|
|
1193 |
|
|
|
|
Genes | ||||
|
|
99.59 |
99.71 |
|
99.76 |
|
|
0 |
2 (4) |
7.4 (15) |
0 |
|
|
30.3 |
|
34.4 |
32.8 |
Presence in Illumina-sequenced genomes from 202 L5 strains of genes detected in only 1 of the complete genomes of the 3 L5 strains from Benin, The Gambia and Nigeria
|
Gene |
Co-ordinates in the specified genome |
Functional group |
Present in L5 Illumina-sequenced genomes % ( | |
|---|---|---|---|---|
|
|
PcbL5_01893 |
2004773–2005918 |
Cell wall and cell processes |
|
|
|
PcbL5_01894 |
2006144–2007247 |
Intermediary metabolism and respiration |
|
|
|
PcbL5_01895 |
2007448–2010285 |
Cell wall and cell processes |
|
Fig. 1.Phylogenetic tree showing the Illumina-sequenced genomes of the six L5 strains (L5.3.2) similar to the complete (PacBio-sequenced) genome of the Nigerian L5 strain (PcbL5Nig) and the position of the other two PacBio-sequenced L5 genomes (PcbL5Ben and PcbL5Gam). NigDel=L5Nig-Del=L5.3.2-Del=region of 30 genes (2 blocks of 19 and 11 genes: Rv1493 through Rv1509 and Rv1511 through Rv1521) missing in L5.3.2 strains but present in all other L5 strains (L5.1, L5.2, L5.3.1 and new sub-lineages).
Genes present or absent in all L5 genomes compared to H37Rv. The genes are ordered in the table along with their flanking genes, as in the genome. The RD regions are those reported by Gordon et al. [13]
|
Gene name |
Present in |
Size |
Co-ordinates in H37Rv |
Co-ordinates in PcbL5Ben |
Functional category |
Present in L5 Illumina-sequenced genomes % ( |
Belongs to RD# (no. total of genes forming the RD) |
|---|---|---|---|---|---|---|---|
|
PcbL5Ben_2128 ( |
|
2218844–2219251 |
2244287–2244565 |
Conserved hypothetical protein |
100 (202) |
RD7 | |
|
|
|
648 bp |
– |
2244695–2245342 |
|
|
|
|
|
|
600 bp |
– |
2246122–2246721 |
|
|
|
|
|
|
1047 bp |
2219754–2220800 |
– |
|
|
|
|
Rv1978 |
H37Rv, L6, |
849 bp |
2220908–2221756 |
– |
Conserved hypothetical protein |
1 (2) |
RD2 |
|
|
|
1446 bp |
2221719–2223164 |
– |
|
|
|
|
PcbL5Ben_2131 ( |
|
2223343–2224029 |
2246867–2247553 |
Cell wall and cell processes |
100 (202) |
RD2 | |
|
………//……… |
|
|
| ||||
|
PcbL5Ben_2149 (Rv1992 ( |
|
2234991–2237306 |
2258516–2259319 |
Cell wall and cell processes |
100 (202) |
| |
|
|
|
273 bp |
2237303–2237575 |
– |
|
|
|
|
Rv1994c (ctmR) |
H37Rv, L6, |
357 bp |
2237628–2237984 |
– |
Regulatory proteins |
0.5 (1) |
|
|
|
|
729 bp |
2238141–2238908 |
– |
|
|
|
|
PcbL5Ben_2150 (Rv1996) |
|
2239004–2239957 |
2259333–2260169 |
Virulence, detoxification, adaptation |
100 (202) |
| |
|
………//……… |
|
|
| ||||
|
PcbL5Ben_2180 (Rv2023A) |
|
2268268–2268726 |
2289843–2290475 |
Conserved hypothetical protein |
100 (202) |
| |
|
|
|
237 bp |
– |
2290472–2290708 |
Hypothetical protein |
|
|
|
|
|
897 bp |
– |
2290915–2291811 |
Unknown function |
|
|
|
PcbL5Ben_2183 (Rv2024c) |
|
2268693–2270240 |
2291995–2296815 |
Conserved hypothetical protein |
100 (202) |
| |
|
………//……… |
|
|
| ||||
|
PcbL5Ben_2231, PcbL5Ben_2232 (Rv2072 (cobL)) |
|
2328974–2330146 |
2355547–2356422, 2356373–2356561 |
Intermediary metabolism and respiration |
100 (202) |
RD9 (4 genes, including Rv2072 truncated) | |
|
|
|
750 bp |
2330214–2330963 |
– |
|
|
|
|
|
|
408 bp |
2330993–2331406 |
|
|
| |
|
PcbL5Ben_2233 (Rv2075c) |
|
2331416–2332879 |
2356636–2357388 |
Cell wall and cell processes |
100 (202) |
RD9 (4 genes, including Rv2075c truncated) |
Comparison of transmission clustering rates based on choice of reference genome. Short-read data from 355 L5 strains were mapped against each of the 4 reference genomes for SNP calling. Distance matrices between all strains were constructed per the reference approach and transmission clusters were defined based on specific SNP cut-offs
|
1 SNP | |||
|---|---|---|---|
|
|
|
|
|
|
H37Rv |
100 |
28.17 |
44 |
|
PcbL5Ben |
94 |
26.48 |
43 |
|
PcbL5Gam |
95 |
26.76 |
43 |
|
PcbL5Nig |
95 |
26.76 |
43 |
|
| |||
|
|
|
|
|
|
H37Rv |
129 |
36.34 |
53 |
|
PcbL5Ben |
124 |
34.93 |
51 |
|
PcbL5Gam |
124 |
34.93 |
51 |
|
PcbL5Nig |
124 |
34.93 |
51 |
|
| |||
|
|
|
|
|
|
H37Rv |
144 |
40.56 |
55 |
|
PcbL5Ben |
141 |
39.72 |
54 |
|
PcbL5Gam |
144 |
40.56 |
55 |
|
PcbL5Nig |
141 |
39.72 |
54 |