| Literature DB >> 26154264 |
Jérôme Azé1, Christophe Sola2, Jian Zhang2, Florian Lafosse-Marin2, Memona Yasmin3, Rubina Siddiqui4, Kristin Kremer5, Dick van Soolingen6, Guislaine Refrégier2.
Abstract
Infra-species taxonomy is a prerequisite to compare features such as virulence in different pathogen lineages. Mycobacterium tuberculosis complex taxonomy has rapidly evolved in the last 20 years through intensive clinical isolation, advances in sequencing and in the description of fast-evolving loci (CRISPR and MIRU-VNTR). On-line tools to describe new isolates have been set up based on known diversity either on CRISPRs (also known as spoligotypes) or on MIRU-VNTR profiles. The underlying taxonomies are largely concordant but use different names and offer different depths. The objectives of this study were 1) to explicit the consensus that exists between the alternative taxonomies, and 2) to provide an on-line tool to ease classification of new isolates. Genotyping (24-VNTR, 43-spacers spoligotypes, IS6110-RFLP) was undertaken for 3,454 clinical isolates from the Netherlands (2004-2008). The resulting database was enlarged with African isolates to include most human tuberculosis diversity. Assignations were obtained using TB-Lineage, MIRU-VNTRPlus, SITVITWEB and an algorithm from Borile et al. By identifying the recurrent concordances between the alternative taxonomies, we proposed a consensus including 22 sublineages. Original and consensus assignations of the all isolates from the database were subsequently implemented into an ensemble learning approach based on Machine Learning tool Weka to derive a classification scheme. All assignations were reproduced with very good sensibilities and specificities. When applied to independent datasets, it was able to suggest new sublineages such as pseudo-Beijing. This Lineage Prediction tool, efficient on 15-MIRU, 24-VNTR and spoligotype data is available on the web interface "TBminer." Another section of this website helps summarizing key molecular epidemiological data, easing tuberculosis surveillance. Altogether, we successfully used Machine Learning on a large dataset to set up and make available the first consensual taxonomy for human Mycobacterium tuberculosis complex. Additional developments using SNPs will help stabilizing it.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26154264 PMCID: PMC4496040 DOI: 10.1371/journal.pone.0130912
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Correspondence table between the different M. tuberculosis taxonomies.
| TBlineage | MIRU-VNTR | SITVITWEB | Borile-AP | Consensus Expert |
|
|---|---|---|---|---|---|
| Lineage 6 | West African 2 | AFRI_1 | Afri1 |
|
|
| (West African 2) | AFRI | ||||
| Lineage 5 | West African 1 | AFRI_2 | Afri2-3 |
|
|
| (West African 1) | AFRI_3 | ||||
| Animal strains | Bovis | BOVIS | bovis |
|
|
| MICROTII, PINI | Pin-Mic | ||||
| CAP | Cap | ||||
| Lineage 1 (Indo Oceanic) | EAI | EAI1_SOM | EAI1 |
|
|
| EAI2_MANILLA,NTB | EAI2 |
|
| ||
| EAI3_IND | EAI3-5 |
|
| ||
| EAI4_VNM | EAI |
|
| ||
| & EAI5 | |||||
| EAI6, EAI 7 | EAI6 |
|
| ||
| Lineage 2 (East Asia) | Beijing | BEIJING | Beij |
|
|
| BEIJING-LIKE | |||||
| Lineage 3 | Dehli/CAS | CAS1_DEHLI | CAS |
|
|
| (India and | CAS1_KILI |
| |||
| East Africa) | CAS2 |
| |||
|
|
|
| |||
| Lineage 4 | Ghana | T1 | T1a—T1b—T1c | in L4 |
|
| (Euro- | UgandaI-II | T2, T2_UGANDA | T2 |
|
|
| American) | EAST_MED1 | T(T1-H-CAM) | in L4 |
| |
| LAM3_S | |||||
| ? | T3 | ? | in L4 |
| |
| ? | T4_CEU | T4 |
|
| |
| ? | T5_MAD2 | T5 |
|
| |
| H37Rv | H37Rv | ? |
|
| |
| TUR | LAM7_TUR | ? |
|
| |
| T1 | ? | ||||
| URAL | H4 (remaned Ural1) | Ural |
|
| |
| New-1 | H4 (remaned Ural2) | ? |
|
| |
| S | S | S |
|
| |
| Cameroon | LAM10_CAM | T-T1 |
|
| |
| Haarlem | H1, H2 | H1-2 |
|
| |
| H3, H3-T3 | H3 |
|
| ||
| LAM | LAM1, LAM2, LAM5 | LAM5-2-1 |
|
| |
| LAM3, 4, 6, 8 | LAM3 | ||||
| LAM9, 11, 12 | LAM9-11 | ||||
| T5_RUS1 | T(T1-H-CAM) | ||||
| T5 | ? |
|
| ||
| X | X2 | X2 |
|
| |
| H1 | H1-2 |
| |||
| X, Haarlem | X1, X3 | X1-3 |
| ||
| ? | ? | MANU1, MANU2, ZERO | Manu |
|
|
The items in italics were subsequently added according to findings in complementary analyses. Parts under brackets indicate synonyms. Complete sublineages under brackets indicate imprecise correspondance. “?” indicate hypotheses with no actual proof.
Fig 1Relative prevalence of main M. tuberculosis complex lineages in the Netherlands (2005–2008).
Major clusters of the 2004–2008 Netherlands RIVM collection (n≥10).
| 24-VNTR Cluster ID | n | Different spoligotype patterns within the same 24-VNTR cluster | SIT | Sublineages (SITVITWEB classification) | Year of first isolation | ID of eldest isolates with corresponding genotype |
|---|---|---|---|---|---|---|
| νν○ννννννννννννννννν○○○○νννννννν○○○○ννννννν | 20 | 2004 | NLA000400263 | |||
| 1 | 64 | νν○ννννν○ννννννννννν○○○○νννννννν○○○○ννν○ννν | New | LAM 1 | 2005 | NLA000500735 |
| νν○ν○ννννννννννννννν○○○○νννννννν○○○○ννννννν | 729 | 2006 | NLA000601675 | |||
| ννννννννννννννννννννννννν○○○○○○ν○○○○ννν○ννν | 62 | H1 | 2004 | NLA000400246 | ||
| 2 | 53 | ννννννννννννννννννννννννν○○○○○○ν○○○○ννννννν | 47 | 2005 | NLA000500437 | |
| ○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○ | 2669 | U | 2006 | NLA000600009 | ||
| 3 | 33 | ννννννν○○○νννννννννννννννννννννν○○○○ννν○ννν | New | U | 2004 | NLA000400201 |
| 4 | 28 | ννννννννννννννννννν○νννννννννννν○○○○ννν○ννν | 736 | T2 | 2004 | NLA000400425 |
| ννννννννννννννννννν○○○○○○ννννννν○○○○ννν○ννν | New | 2005 | NLA000501826 | |||
| 5 | 22 | νννννννννννννννννννν○○○○νννννννν○○○○ννννννν | 42 | LAM9 | 2004 | NLA000400150 |
| νννννννννννννννννννννννννννννννν○○○○ννννννν | 53 | T | 2005 | NLA000500775 | ||
| 6 | 19 | ννν○○○○νν○ννννννννν○○○○○○○○○○○○○○○○νννννννν | 21 | CAS1_KILI | 2004 | NLA000401265 |
| ννν○○○○νννννννννννν○○○○○○○○○○○○○○○○νννννννν | 22 | 2005 | NLA000500746 | |||
| νννννννννννν○νννννν○νν○ννννννννν○○○○ννννννν | 1227 | 2004 | NLA000400237 | |||
| 7 | 18 | ννννννννννννννννννν○νν○ννννννννν○○○○ννννννν | 58 | T5_MAD2 | 2004 | NLA000400972 |
| νννννννννννννννννννννν○ννννννννν○○○○ννννννν | 44 | 2004 | NLA000401032 | |||
| 8 | 17 | νν○νννν○○○○○○○○○○○○○○○○○○ννν○○○○ν○ννννννννν | 89 | EAI2_NTB | 2004 | NLA000400077 |
| 9 | 15 | ννννννννννννννννννννννννν○ννν○νν○○○○ννννννν | 1558 | T1 | 2004 | NLA000400231 |
| 10 | 14 | ○○○○○○○○○○○○○○○○○○○○○○○○ν○○○○○○ν○○○○ννννννν | 2 | H2 | 2004 | NLA000400112 |
| ννννννννννννννννννν○○○○○ν○○ννννν○○○○ννννννν | 41 | 2004 | NLA000401211 | |||
| νν○νννννννννννννννν○○○○○ν○○ννννν○○○○ννννννν | 930 | 2005 | NLA000500774 | |||
| 11 | 14 | ννννννννννννννννννν○○○○○ν○○ννννν○○○○ννν○ννν | 1261 | TUR | 2005 | NLA000501593 |
| νννννννννννν○νννννν○○○○○ν○○ννννν○○○○ννννννν | 367 | 2006 | NLA000601569 | |||
| ννν○○νννννννννννννν○○○○○ν○○ννννν○○○○ννννννν | New | 2007 | NLA000701171 | |||
| ννν○○○○ννννννννννννννν○○○○○○○○○○○○○νννννννν | 203 | 2004 | NLA000401787 | |||
| 12 | 13 | ννν○○○○ννννννννννννννν○○○○○○○○○○ν○○νννννννν | New | CAS | 2005 | NLA000500783 |
| ννν○○○○νννννννννννννν○○○○○○○○○○○○○○○ννννννν | 1949 | 2008 | NLA000800421 | |||
| 13 | 12 | ννν○○○○ννννννννννννννν○○○○○○○○○○○○ννν○ννννν | 289 | CAS1_DELHI | 2004 | NLA000400590 |
| ννν○○○○ννννννννννννννν○○○○○○○○○○○○νν○○ννννν | 25 | 2005 | NLA000500524 | |||
| ννννννννν○○○○○○○○○○ννννννννννννν○○○○ννννννν | 149 | 2004 | NLA000400548 | |||
| 14 | 11 | ννννννννν○○○○○○○○○○ννννννννννννν○○○○νννν○νν | New | T3_ETH | 2006 | NLA000600430 |
| ννννννννν○○○○○○○○○○ννννννννννννν○○○○ννν○ννν | 345 | 2008 | NLA000800132 | |||
| 15 | 11 | ○○○○○○○○○○○○○○○○○○○○○○○○νννννννν○○○○ννννννν | 1280 | T1 | 2004 | NLA000400046 |
| 16 | 11 | νν○○○○○○νννννννννννν○○○○νν○○○○νν○○○○ννννννν | 1607 | LAM11_ZWE | 2005 | NLA000500458 |
| 17 | 10 | ννν○○○○○○○○○ννννν○νννννννννννννν○○○○ννννννν | 92 | X3 | 2004 | NLA000400304 |
| ννννννννννννννννννννννν○○○○○○○○○○○○○○○○○○○○ | 786 | 2004 | NLA000401283 | |||
| νννννννννννννννννννννννννννννν○○○○○○○○○○○○○ | 237 | 2005 | NLA000500740 | |||
| 18 | 10 | νννννννννν○○○○ννννννν○νννννννν○○○○○○○○○○○○○ | 465 | U | 2005 | NLA000500790 |
| νννννννννν○○○○ννν○ννν○○○○○○○○○○○○○○○○○○○○○○ | New | 2005 | NLA000501258 | |||
| νννννννννννννννννννν○○○○○○○○○○○○○○○○○○○○○○○ | 402 | 2006 | NLA000601923 | |||
| νννννννννννννννννννννννν○○○○○○○○○○○○○○○○○○○ | 46 | 2008 | NLA000801472 | |||
| ○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○ | 2669 | 2008 | NLA000801594 | |||
| νννννννννννν○ννννννννννννννννν○ν○○○○ννννννν | 36 | H3-T3 | 2004 | NLA000401512 | ||
| νννννννννννννννννννννννννννννν○ν○○○○ννννννν | 50 | H3 | 2005 | NLA000500512 | ||
| 19 | 10 | ννννννννννννννννννννννννν○○○○○○ν○○○○ννννννν | 47 | H1 | 2005 | NLA000501842 |
| ννννννννννννννννννν○○ννννννννν○ν○○○○ννννννν | New | H3 | 2006 | NLA000601244 | ||
| νννννννννννννννννν○ννννννννννν○ν○○○○ννννννν | 183 | H3 | 2006 | NLA000601580 | ||
| ν○○νννννννννννννννννννννν○○○○○○ν○○○○ννννννν | 1652 | H1 | 2008 | NLA000801391 |
24-VNTR clusters ID numbers were attributed according to their size (n°1 for the largest). Isolates ID are those stated in S1 Table. SIT = Short International Type.
Fig 2Concordance of existing classifications with the consensus classification proposed in this study.
Accuracy of different induction algorithms on the training dataset using 10-fold stratified cross-validation.
| Input | Predicted | Nb | Median | Induction algorithms accuracies | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| data | classification | lineages/ sublineages | lineage size | J48 | JRip | NB | PART | RF | Vote-5 | Vote-10 |
| spoligo | TB-Lineage (Pred1) | 7 | 213 | 99.5 | 99.7 | 98.5 | 99.6 | 99.7 | 99.8 | 99.8 |
| Borile (Pred3) | 28 | 51 | 97.2 | 97.8 | 87.8 | 97.5 | 97.8 | 98.3 | 98.5 | |
| SITVITWEB-expert (Pred4) | 52 | 28 | 96.7 | 96.6 | 89.3 | 96.7 | 97.6 | 97.7 | 97.9 | |
| 24-VNTR | MIRU-VNTR | 18 | 99 | 88.3 | 88.2 | 85.1 | 89 | 91.9 | 91 | 91.4 |
| Expert-consensus (Pred5) | 24 | 45 | 86.6 | 80.9 | 80.5 | 87.1 | 90.2 | 88.6 | 88.6 | |
| 15-VNTR | MIRU-VNTR | 18 | 99 | 88.2 | 86.1 | 84 | 88.5 | 91 | 91.8 | 92 |
| Expert-consensus (Pred5bis) | 24 | 45 | 84.6 | 78.8 | 79.4 | 85.3 | 88.6 | 90.3 | 90.3 | |
NB: Naïve Bayes. RF: Random Forest. Vote-5: Vote including the 5 algorithms shown here (from J48 to RF). Vote-10: Vote including the 5 algorithms and their meta-bagging derivatives.
*:algorithm used in Lineage Prediction tool on TBminer website. For details on the algorithm, see Material and Methods. Font size underlines performance.
Fig 3TBminer Lineage Prediction tool: the output file.
Fig 4TBminer Prediction tool performance on Miru-VntrPlus database.
A. Concordance between TBminer Pred2_Miru-Vntr and Miru-VntrPlus assignations. B. Concordance between Pred6 and manual expert assignation accounting for original labels.
Fig 5TBminer Prediction tool performance on a Pakistanis sample.
Consensus Lineage Prediction tool of TBminer was compared to the Expert assignation on an independent dataset from Pakistan.
Concordance between SNP classification and the newly proposed and automatized consensus tool on a set of isolates with conflicting assignations in existing taxonomies.
| Lineage or Sublineage | Concor-dance | |||||
|---|---|---|---|---|---|---|
| SIT | Spoligotype pattern | classical (SITVIT-WEB) | SNP (Abadia | TBminer Consensus Lineage Prediction | between SNP and Consensus Lineage Pred | N |
| 254 | νννννννννννννν○○○○○○○○○○νννννννν○○○○ννννννν | T5_RUS1 | LAM | Lineage4_LAM |
| 9 |
| 316 | νννννννννννννννννννννννν○○○○○○○ν○○○○ννν○ννν | H3 | T2 | Lineage4_New1 (Ural2)? |
| 2 |
| 316 | νννννννννννννννννννννννν○○○○○○○ν○○○○ννν○ννν | H3 | T2 | Lineage4-T2?_H? |
| 4 |
| 1531 | ννννννννννννννννν○νννννννννννννν○○○○○○○νννν | U | X | Lineage4_X |
| 3 |
| 134 | νννννννννννννννννννννννννννννν○ν○○○○νν○○ννν | H3 | X | Lineage4_H3 | - | 2 |
| 78 | νννννννννννννννννννννννννννννννν○○○○ννν○○νν | T1-T2 | Tur | Lineage4 |
| 1 |
| 1274 | ννννννννννννννννννννννν○○○○○○○○○○○○○○○○νννν | U | H | Lineage4_H? |
| 1 |
The only available genotypic data available to perform classification was spoligotype patterns.
Fig 6Approach for consensus building between conflictive taxonomies.