| Literature DB >> 25879806 |
Kayo Okumura1, Masako Kato2, Teruo Kirikae3, Mitsunori Kayano4, Tohru Miyoshi-Akiyama5.
Abstract
BACKGROUND: Although Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results. Those analyses are essentially based genome sequence information of M. tuberculosis and could be performed in sillico in theory, with whole genome sequence (WGS) data available in the databases and obtained by next generation sequencers (NGSs). As an approach to establish higher resolution methods for such analyses, whole genome sequences of the M. tuberculosis complexes (MTBCs) strains available on databases were aligned to construct virtual reference genome sequences called the consensus sequence (CS), and evaluated its feasibility in in sillico epidemiological analyses.Entities:
Mesh:
Year: 2015 PMID: 25879806 PMCID: PMC4425900 DOI: 10.1186/s12864-015-1368-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
strains used in this study
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| CCDC5079 | CP001642.1 | China | Drug-susceptible isolate belonging to the Beijing family. | |
| CCDC5180 | CP001642.1 | China | Multidrug-resistant clinical isolate. | ||
| CDC1551 | AE000516.2 | USA | |||
| CTRI-2 | CP002992.1 | Russia | |||
| Erdman | AP012340.1 | USA | isolated from sputum samples of patients | ||
| F11 | CP000717.1 | South Africa | Predominant strain in South African epidemic | ||
| H37Ra | CP000611.1. | China | An avirulent strain derived from its virulent parent strain H37 | ||
| H37Rv | AL123456.2 | USA | |||
| KZN605 | CP001976.1 | South Africa | 1. between 932051 and 932052. 2. between 3479594 and 3459595 | Extensively drug-resistant clinical isolate | |
| KZN1435 | CP001658.1 | South Africa | 1. between 931985 and 931986. 2. between 3479865 and 3479866 | Multidrug-resistant clinical isolate | |
| KZN4207 | CP001662.1 | South Africa | 1. between 932007 and 932008. 2. between 3476553 and 3476554 | Drug-susceptible clinical isolate | |
| RGTB327 | CP003233.1 | India | isolated from sputum samples of patients | ||
| RGTB423 | CP003234.1 | India | isolated from sputum samples of patients | ||
|
| Mexico | CP002095.1 | Mexico | ||
| Moreau RDJ | AM412059.2 | Brazil | Brazilian vaccine strain | ||
| Pasteur | AM408590.1 | France | |||
| Tokyo 172 | AP010918.1 | Japan | |||
|
| GMO41182 | FR878060.1 | Gambia | ||
|
| CIPT 140010059 | HE572590.1 | France |
Efficacy of SNP calling using the consensus sequence or H37Rv as the reference genome by comparing the number of SNPs detected
|
| |||
|---|---|---|---|
|
|
|
|
|
|
| CCDC5079 | 632 | 198* |
| CCDC5180 | 378 | 70* | |
| CDC1551 | 543 | 41* | |
| CTRI-2 | 140 | 15* | |
| Erdman | 570 | 104* | |
| F11 | 343 | 129* | |
| H37Ra | 44 | 38* | |
| H37Rv | 12 | ||
| KZN605 | 16 | 9 | |
| KZN1435 | 12 | 2 | |
| KZN4207 | 26 | 0 | |
| RGTB327 | 1242 | 127* | |
| RGTB423 | 2410 | 145* | |
|
| Mexico | 13 | 0 |
| Moreau RDJ | 136 | 22* | |
| Pasteur | 20 | 0 | |
| Tokyo 172 | 56 | 76* | |
|
| GMO41182 | 876 | 152* |
|
| CIPT 140010059 | 24425* | 560 |
| Total SNP found in only one strain | 31882 | 1700 | |
| ii) group comparison | |||
| four BCG strains | 1040 | 52* | |
| three KZN strains | 121 | 15 | |
| All SNP at least found in one strain | 37589 | 3429 | |
Based on SNP calls using MUMmer [58], the number of SNPs called uniquely in individual strains, and in groups of BCG and KZN strains, was determined. *The number of SNPs detected using the H37Rv and consensus sequences as reference were compared for each strain (i) or group of strains (ii) using Fisher's exact test, with significant differences indicated with asterisks (p < 0.0001).
Figure 1Maximum-likelihood phylogenies based on whole genome and SNP concatenated sequence alignment. Phylogenetic trees based on whole genome sequences (a), SNP concatemers using CS as reference (b) and SNP concatemers using the H37Rv genome sequence as reference (c) were constructed using PhyML 3.0 [39]. Most probable trees were selected based on aLTR statics implemented in PhyML [39]. Isolates, clustered into different positions compared with the phylogenetic tree based on the whole genome sequences of M. tuberculosis strains RGTB327 and RGTB423, are indicated in the squares. For the KZN series, inversion-corrected sequences were used for the alignment and marked “_m”. aLTR statics values for each branch are shown. In Figure 1a, clusters of lineage 4 and 2 are indicated in the squares.
Figure 2Bayesian post-probable phylogenies based on whole genome and SNP concatenated sequence alignment. Description of data: Phylogenetic trees based on whole genome sequence (a), SNP concatemers using the consensus genome sequence as reference (b) and SNP concatemers using H37Rv genome as reference were constructed using BEAST [38]. All relevant parameters reached an effective sample size of >100, indicating good convergence of the chains. For each branch, 95% highest posterior density is shown with good support. Isolates, clustered into different positions compared with the phylogenetic tree based on the whole genome sequences of M. tuberculosis are indicated in the squares.
Distance analysis among phylogenic trees constructed based on maximum-likelihood and Bayesian MCMC methods
|
|
|
| ||
|---|---|---|---|---|
|
|
|
|
| |
| PhyML(aLRT) | 2.88E-01 | 2.88E-01 | 5.95E-01 | 5.95E-01 |
| PhyML(consel selected) | 2.89E-01 | 2.89E-01 | 5.96E-01 | 5.96E-01 |
| BEAST | 1.68E + 03 | 1.20E + 03 | 2.61E + 03 | 1.91E + 03 |
To quantify the branch score distance between trees, Robinson and Fould test [41] implemented in treedist in the Phylip package was utilized. Both of unrooted and rooted scores were calculated.
Computational time for each phylogenetic analysis
|
|
|
|
|
|---|---|---|---|
| Whole genome | 34h59min55sec | 2min49sec | 21h40min45sec |
| Consensus sequence | 4h50min9sec | 1min24sec | 1h47min24sec |
| H37Rv | 19h24min38sec | 2min37sec | 2h38min31sec |
For PhyML, two tree selection methods, 100 bootstrappings and aLRT were performed. In the bootstrap analyses, multithreading with 16 CPUs were utilized to reduce the computational time. For BEAST, 10 million samplings were performed for each analysis. Computational times were based on the log file of each analysis.
Comparison of Illumina read mapping efficacy using clinical isolates derived from different lineages using Bowtie2 and SAMtools
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||
|
|
|
|
|
|
|
| ||
|
| ||||||||
| F092 | EAI | mapped | 681561 | 664376 | 684952 | 667250 | ||
| unmapped | 22261 | 39446 | 18870 | 36572 | ||||
| ratio (%) | 96.837 | 94.395 | 97.319 | 94.804 | 0.482 | 0.408 | ||
| J156 | EAI | mapped | 1680156 | 1650866 | 1689673 | 1658917 | ||
| unmapped | 40162 | 69452 | 30645 | 61401 | ||||
| ratio (%) | 97.665 | 95.963 | 98.219 | 96.431 | 0.553 | 0.468 | ||
| F038 | Haarlem, LAM, X etc. | mapped | 1024873 | 997301 | 1029625 | 1000714 | ||
| unmapped | 75113 | 102685 | 70361 | 99272 | ||||
| ratio (%) | 93.171 | 90.665 | 93.603 | 90.975 | 0.432 | 0.310 | ||
| F070 | Haarlem, LAM, X etc. | mapped | 858126 | 840921 | 861393 | 843463 | ||
| unmapped | 27822 | 45027 | 24555 | 42485 | ||||
| ratio (%) | 96.860 | 94.918 | 97.228 | 95.205 | 0.369 | 0.287 | ||
| J073 | Haarlem, LAM, X etc. | mapped | 1534315 | 1503494 | 1537891 | 1504256 | ||
| unmapped | 11979 | 42800 | 8403 | 42038 | ||||
| ratio (%) | 99.225 | 97.232 | 99.457 | 97.281 | 0.231 | 0.049 | ||
| J147 | Haarlem, LAM, X etc. | mapped | 847807 | 836269 | 849747 | 836489 | ||
| unmapped | 11775 | 23313 | 9835 | 23093 | ||||
| ratio (%) | 98.630 | 97.288 | 98.856 | 97.313 | 0.226 | 0.026 | ||
| F081 | other non-Beijing | mapped | 1004912 | 974425 | 1008107 | 976556 | ||
| unmapped | 43978 | 74465 | 40783 | 72334 | ||||
| ratio (%) | 95.807 | 92.901 | 96.112 | 93.104 | 0.305 | 0.203 | ||
| J020 | other non-Beijing | mapped | 1081365 | 1062537 | 1085065 | 1065704 | ||
| unmapped | 11293 | 30121 | 7593 | 26954 | ||||
| ratio (%) | 98.966 | 97.243 | 99.305 | 97.533 | 0.339 | 0.290 | ||
| J027 | other non-Beijing | mapped | 751633 | 741219 | 754861 | 744254 | ||
| unmapped | 5259 | 15673 | 2031 | 12638 | ||||
| ratio (%) | 99.305 | 97.929 | 99.732 | 98.330 | 0.426 | 0.401 | ||
| F022 | Ancestral Beijing | mapped | 1162270 | 1143243 | 1166545 | 1147960 | ||
| unmapped | 26600 | 45627 | 22325 | 40910 | ||||
| ratio (%) | 97.763 | 96.162 | 98.122 | 96.559 | 0.360 | 0.397 | ||
| J090 | Ancestral Beijing | mapped | 490815 | 484340 | 492424 | 486326 | ||
| unmapped | 5153 | 11628 | 3544 | 9642 | ||||
| ratio (%) | 98.961 | 97.655 | 99.285 | 98.056 | 0.324 | 0.400 | ||
| J002 | Ancestral Beijing | mapped | 736473 | 727044 | 739288 | 730539 | ||
| unmapped | 5757 | 15186 | 2942 | 11691 | ||||
| ratio (%) | 99.224 | 97.954 | 99.604 | 98.425 | 0.379 | 0.471 | ||
| J029 | Modern Beijing | mapped | 953792 | 936476 | 957539 | 941221 | ||
| unmapped | 10220 | 27536 | 6473 | 22791 | ||||
| ratio (%) | 98.940 | 97.144 | 99.329 | 97.636 | 0.389 | 0.492 | ||
| F076 | Modern Beijing | mapped | 532526 | 519473 | 534742 | 522431 | ||
| unmapped | 16374 | 29427 | 14158 | 26469 | ||||
| ratio (%) | 97.017 | 94.639 | 97.421 | 95.178 | 0.404 | 0.539 | ||
| J111 | Modern Beijing | mapped | 719693 | 708076 | 722895 | 712304 | ||
| unmapped | 14341 | 25958 | 11139 | 21730 | ||||
| ratio (%) | 98.046 | 96.464 | 98.482 | 97.040 | 0.436 | 0.576 | ||
| ii) Comparison of mappping frequency ratio (%) among the MTBC lineanges | ||||||||
| EAI | Haarlem, LAM, X etc. | other non-Beijing | ||||||
| Haarlem, LAM, X etc. | ns | - | - | |||||
| other non-Beijing | ns | ns | - | |||||
| Beijing | P < 0.05 | ns | ns | |||||
In this analysis CS based on 13 M. tuberculosis strains (Table 1) was used as the consensus sequence. i) After mapping with Bowtie2 [42] against H37Rv or CS, the idxstats command of SAMtools [43] was used to calculate the mapping efficacy (Table 5). In read mapping with Bowtie2, both of local and end-to-end mapping mode were performed, and the other parameters were set with default values. Significant differences in mapping frequencies were assessed using multiple comparisons of proportions tests [44]. For all isolates, the difference between H37Rv and CS as a reference differed significantly (p < 0.0001). For both mapping modes, the ratio of mapped to total reads was calculated, and these values used to calculate differences in mapping frequency between the consensus and H37Rv sequences by simple subtraction.
ii) Based on the difference in mapping frequency in 1), the mapping frequencies of MTBC lineages were compared using Mann–Whitney U tests. Combination of Beijing and EAI sequences showed the significan difference (p < 0.05) in mapping frequencies when compared relative to the consensus and H37Rv sequences, the latter belonging to the Haarlem, LAM, X etc. lineage (linage 4).
Comparison of Illumina read mapping efficacy using clinical isolates derived from different lineages using
|
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||||
| F092 | EAI | mapped | 676219 | 676007 | 674779 | 677079 | 676941 | 676114 | |||
| unmapped | 20275 | 20487 | 21715 | 19415 | 19553 | 20380 | |||||
| ratio (%) | 97.089 | 97.059 | 96.882 | 97.212 | 97.193 | 97.074 | 0.123 | 0.134 | 0.192 | ||
| J156 | EAI | mapped | 1656675 | 1656191 | 1652024 | 1661496 | 1660887 | 1657869 | |||
| unmapped | 37151 | 37635 | 41802 | 32330 | 32939 | 35957 | |||||
| ratio (%) | 97.807 | 97.778 | 97.532 | 98.091 | 98.055 | 97.877 | 0.285 | 0.277 | 0.345 | ||
| F038 | Haarlem, LAM, X etc. | mapped | 985717 | 985256 | 978713 | 986844 | 986318 | 980541 | |||
| unmapped | 43761 | 44222 | 50765 | 42634 | 43160 | 48937 | |||||
| ratio (%) | 95.749 | 95.704 | 95.069 | 95.859 | 95.808 | 95.246 | 0.109 | 0.103 | 0.178 | ||
| F070 | Haarlem, LAM, X etc. | mapped | 847048 | 846879 | 844774 | 847486 | 847257 | 845485 | |||
| unmapped | 21212 | 21381 | 23486 | 20774 | 21003 | 22775 | |||||
| ratio (%) | 97.557 | 97.537 | 97.295 | 97.607 | 97.581 | 97.377 | 0.05 | 0.044 | 0.082 | ||
| J073 | Haarlem, LAM, X etc. | mapped | 1511361 | 1511205 | 1508937 | 1512926 | 1512725 | 1511328 | |||
| unmapped | 25005 | 25211 | 27479 | 23490 | 23691 | 25088 | |||||
| ratio (%) | 98.372 | 98.359 | 98.211 | 98.471 | 98.458 | 98.367 | 0.099 | 0.099 | 0.156 | ||
| J147 | Haarlem, LAM, X etc. | mapped | 835484 | 835319 | 834077 | 835192 | 834976 | 834227 | |||
| unmapped | 14694 | 14859 | 16101 | 14986 | 15202 | 15951 | |||||
| ratio (%) | 98.272 | 98.252 | 98.106 | 98.237 | 98.212 | 98.124 | −0.034 | −0.04 | 0.018 | ||
| F081 | other non-Beijing | mapped | 984775 | 984303 | 981542 | 986321 | 986079 | 983479 | |||
| unmapped | 31067 | 31539 | 34300 | 29521 | 29763 | 32363 | |||||
| ratio (%) | 96.942 | 96.895 | 96.623 | 97.094 | 97.07 | 96.814 | 0.152 | 0.175 | 0.191 | ||
| J020 | other non-Beijing | mapped | 1061748 | 1061460 | 1059770 | 1063613 | 1063346 | 1062167 | |||
| unmapped | 22918 | 23206 | 24896 | 21053 | 21320 | 22499 | |||||
| ratio (%) | 97.887 | 97.861 | 97.705 | 98.059 | 98.034 | 97.926 | 0.172 | 0.174 | 0.221 | ||
| J027 | other | mapped | 741045 | 740904 | 739801 | 742516 | 742329 | 741694 | |||
| non-Beijing | unmapped | 13721 | 13862 | 14965 | 12250 | 12437 | 13072 | ||||
| ratio (%) | 98.182 | 98.163 | 98.017 | 98.377 | 98.352 | 98.268 | 0.195 | 0.189 | 0.251 | ||
| F022 | Ancestral Beijing | mapped | 1140373 | 1140107 | 1137877 | 1145147 | 1144862 | 1143607 | |||
| unmapped | 32673 | 32939 | 35169 | 27899 | 28184 | 29439 | |||||
| ratio (%) | 97.215 | 97.192 | 97.002 | 97.622 | 97.597 | 97.49 | 0.407 | 0.405 | 0.488 | ||
| J090 | Ancestral Beijing | mapped | 2087545 | 2086983 | 2082777 | 2095411 | 2094879 | 2092249 | |||
| unmapped | 47551 | 48113 | 52319 | 39685 | 40217 | 42847 | |||||
| ratio (%) | 97.773 | 97.747 | 97.55 | 98.141 | 98.116 | 97.993 | 0.368 | 0.37 | 0.444 | ||
| J002 | Ancestral Beijing | mapped | 725501 | 725308 | 724182 | 727822 | 727702 | 727147 | |||
| unmapped | 13427 | 13620 | 14746 | 11106 | 11226 | 11781 | |||||
| ratio (%) | 98.183 | 98.157 | 98.004 | 98.497 | 98.481 | 98.406 | 0.314 | 0.324 | 0.401 | ||
| J029 | Modern Beijing | mapped | 935765 | 935598 | 934129 | 939368 | 939185 | 938425 | |||
| unmapped | 21607 | 21774 | 23243 | 18004 | 18187 | 18947 | |||||
| ratio (%) | 97.743 | 97.726 | 97.572 | 98.119 | 98.1 | 98.021 | 0.376 | 0.375 | 0.449 | ||
| F076 | Modern Beijing | mapped | 523546 | 523438 | 522478 | 526300 | 526150 | 525618 | |||
| unmapped | 17480 | 17588 | 18548 | 14726 | 14876 | 15408 | |||||
| ratio (%) | 96.769 | 96.749 | 96.572 | 97.278 | 97.25 | 97.152 | 0.509 | 0.501 | 0.58 | ||
| J111 | Modern Beijing | mapped | 703968 | 703761 | 702412 | 708028 | 707765 | 707024 | |||
| unmapped | 17244 | 17451 | 18800 | 13184 | 13447 | 14188 | |||||
| ratio (%) | 97.609 | 97.58 | 97.393 | 98.172 | 98.135 | 98.033 | 0.563 | 0.555 | 0.639 | ||
| ii) Comparison of mappping frequency ratio (%) among the MTBC lineanges | |||||||||||
| EAI | HaarlemLAM, X etc. | other non-Beijing | |||||||||
| Haarlem,LAM, X etc. | ns | - | - | ||||||||
| non-Beijing | ns | ns | - | ||||||||
| Beijing | P<0.05 | P<0.01 | P<0.05 | ||||||||
In this analysis CS based on 13 M. tuberculosis strains (Table 1) was used as the consensus sequence. i) The effects on mapping efficacy were tested for three combinations of parameters: mismatch cost, insertion cost, deletion cost, matching length and similarity. *Mapping stringency was defined as Ambiguous, with frequencies of mismatch cost, insertion cost, deletion cost, matching length and similarity of 2, 2, 2, 0.5, and 0.8, respectively; Medium, with frequencies of 2, 3, 3, 0.5, and 0.8, respectively; and Strict, with frequencies of 3, 3, 3, 0.5, and 0.95, respectively. Significant differences in mapping frequencies were assessed using multiple comparisons of proportions tests [44]. For all isolates, the difference between H37Rv and CS as a reference differed significantly (p < 0.0001). For each stringency setting, the ratio of mapped to total reads was calculated, and these values used to calculate differences in mapping frequency between the consensus and H37Rv sequences by simple subtraction.
ii) Based on the difference in mapping frequency in 1), the mapping frequencies of MTBC lineages were compared using Mann–Whitney U tests. The Haarlem, LAM, X and Beijing sequences showed the greatest difference (p < 0.01) in mapping frequencies when compared relative to the consensus and H37Rv sequences.
Figure 3Virtual VNTR profile of each genome and the similarity search in MIRU-VNTR. The number of each VNTR marker in each genome was analyzed based on primer sequences [45]. These numbers were input for VNTR analyses by MIRU-VNTR (http://www.miru-vntrplus.org/). The closest match is presented immediately below the line for the corresponding isolate. Markers differing in number for the analyzed strain and its closest match are indicated in bold boxes.
Figure 4Virtual spoligotyping of each strain and similarity search at MIRU-VNTR. Each strain's genome was analyzed for the presence of absence of each spoligo spacer based on primer sequences [50]. The presence or absence of each spacer input for spoligotyping analysis at MIRU-VNTR (http://www.miru-vntrplus.org/). The closest match by similarity search is shown immediately below the line for the corresponding isolate. Markers differing in presence or absence in the analyzed strain and its closest match are indicated in bold boxes. * reported as Beijing type [56].
Virtual analyses of LSP and Beijing typing
|
|
|
|
| |
|---|---|---|---|---|
|
|
| |||
|
| CCDC5079 | 2 | No | - |
| CCDC5180 | 2 | Yes | Modern | |
| CDC1551 | 4 | No | - | |
| CTRI-2 | 4 | No | - | |
| Erdman | 4 | No | - | |
| F11 | 4 | No | - | |
| H37Ra | 4 | No | - | |
| H37Rv | 4 | No | - | |
| KZN605 | 4 | No | - | |
| KZN1435 | 4 | No | - | |
| KZN4207 | 4 | No | - | |
| RGTB327 | 4 | No | - | |
| RGTB423 | 4 | No | - | |
|
| Mexico | 4 | No | - |
| Moreau RDJ | 4 | No | - | |
| Pasteur | 4 | No | - | |
| Tokyo 172 | 4 | No | - | |
|
| GMO41182 | 6 | No | - |
|
| CIPT 140010059 | 2 | No | - |
LSP analysis [8] and Beijing typing [55] of target sequences were performed in silico using the indicated primers in the articles.