| Literature DB >> 27671088 |
Alexandre Drouin1, Sébastien Giguère2, Maxime Déraspe3, Mario Marchand4,5, Michael Tyers2, Vivian G Loo6,7, Anne-Marie Bourgault6,7, François Laviolette4,5, Jacques Corbeil3,5.
Abstract
BACKGROUND: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.Entities:
Keywords: Antibiotic resistance; Bacteria; Biomarker discovery; Genomics; Machine learning
Year: 2016 PMID: 27671088 PMCID: PMC5037627 DOI: 10.1186/s12864-016-2889-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Dataset summary: distribution of resistant and sensitive isolates in each dataset
Feature selection: Average testing set error rate and sparsity (in parentheses) for 10 random partitions of the data
| Dataset | SCM |
|
|
|
| Baseline |
|---|---|---|---|---|---|---|
|
| ||||||
| Azithromycin |
| 0.086 (7.2) | 0.064 (20326.0) | 0.056 (106) | 0.075 (3.0) | 0.446 |
| Ceftriaxone |
| 0.117 (6.8) | 0.087 (8114.1) | 0.102 (106) | 0.111 (3.2) | 0.306 |
| Clarithromycin |
| 0.070 (8.0) | 0.062 (36686.1) | 0.059 (106) | 0.069 (3.5) | 0.446 |
| Clindamycin | 0.021 (1.4) | 0.011 (2.0) | 0.009 (598.2) | 0.021 (106) |
| 0.136 |
| Moxifloxacin |
|
|
| 0.048 (106) | 0.021 (1.1) | 0.390 |
|
| ||||||
| Ethambutol | 0.179 (1.4) | 0.185 (1.9) |
| 0.221 (106) | 0.174 (3.2) | 0.351 |
| Isoniazid | 0.021 (1.0) | 0.021 (1.1) |
| 0.125 (106) | 0.021 (1.2) | 0.421 |
| Pyrazinamide |
| 0.371 (4.4) | 0.353 (481.2) | 0.342 (106) | 0.366 (5.8) | 0.347 |
| Rifampicin | 0.031 (1.4) | 0.031 (1.5) | 0.031 (130.0) | 0.196 (106) |
| 0.452 |
| Streptomycin | 0.050 (1.0) | 0.052 (1.6) |
| 0.137 (106) | 0.050 (2.1) | 0.435 |
|
| ||||||
| Amikacin | 0.175 (4.9) | 0.206 (14.1) | 0.187 (11514.6) |
|
| 0.216 |
| Doripenem | 0.270 (1.4) |
|
| 0.275 (106) | 0.307 (8.5) | 0.359 |
| Levofloxacin |
| 0.076 (1.0) | 0.085 (148.9) | 0.212 (106) | 0.083 (3.5) | 0.463 |
| Meropenem | 0.267 (1.6) |
| 0.328 (5368.5) | 0.327 (106) | 0.331 (9.1) | 0.404 |
|
| ||||||
| Benzylpenicillin | 0.013 (1.1) | 0.012 (2.3) |
| 0.013 (106) | 0.013 (1.3) | 0.073 |
| Erythromycin |
| 0.047 (3.8) | 0.041 (328.8) | 0.042 (106) | 0.041 (5.1) | 0.142 |
| Tetracycline | 0.031 (1.1) |
| 0.032 (1108.5) | 0.037 (106) | 0.033 (2.2) | 0.106 |
Results are shown for the SCM, which uses the entire feature, and the feature selection-based methods: χ 2+CART, χ 2+L1SVM, χ 2+L2SVM and χ 2+SCM. The baseline method predicts the most abundant class in the training set. The smallest error rates are in bold
Entire feature space: Average testing set error rate and sparsity (in parentheses) for 10 random partitions of the data
| Dataset | SCM | LinSVM | PolySVM | Baseline |
|---|---|---|---|---|
|
| ||||
| Azithromycin |
| 0.050 (32 752 570) | 0.048 (32 752 570) | 0.446 |
| Ceftriaxone |
| 0.079 (25 405 987) | 0.076 (25 405 987) | 0.306 |
| Clarithromycin |
| 0.053 (32 752 570) | 0.053 (32 752 570) | 0.446 |
| Clindamycin |
| 0.039 (30 988 214) | 0.039 (30 988 214) | 0.136 |
| Moxifloxacin |
| 0.054 (32 752 570) | 0.048 (32 752 570) | 0.390 |
|
| ||||
| Ethambutol |
| 0.215 (9 465 489) | 0.221 (9 465 489) | 0.351 |
| Isoniazid |
| 0.117 (9 701 935) | 0.119 (9 701 935) | 0.421 |
| Pyrazinamide |
| 0.382 (8 058 479) | 0.382 (8 058 479) | 0.347 |
| Rifampicin |
| 0.200 (9 701 935) | 0.204 (9 701 935) | 0.452 |
| Streptomycin |
| 0.143 (9 282 080) | 0.148 (9 282 080) | 0.435 |
|
| ||||
| Amikacin |
| 0.184 (116 441 834) | 0.179 (116 441 834) | 0.216 |
| Doripenem |
| 0.288 (122 438 059) | 0.281 (122 438 059) | 0.359 |
| Levofloxacin |
| 0.221 (122 216 859) | 0.225 (122 216 859) | 0.463 |
| Meropenem |
| 0.329 (123 466 989) | 0.331 (123 466 989) | 0.404 |
|
| ||||
| Benzylpenicillin |
| 0.015 (8 968 176) | 0.015 (8 968 176) | 0.073 |
| Erythromycin |
| 0.046 (9 666 898) | 0.047 (9 666 898) | 0.142 |
| Tetracycline |
| 0.039 (8 657 259) | 0.037 (8 657 259) | 0.106 |
Results are shown for the SCM and the kernel methods: LinSVM and PolySVM. The baseline method predicts the most abundant class in the training set. The smallest error rates are in bold
Fig. 2Antibiotic resistance models: Six antibiotic resistance models, which are all disjunctions (logical-OR). The rounded rectangles correspond to antibiotics. The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule as defined by Eqs. (3) and (4). The models for all 17 datasets are illustrated in Additional file 4: Figure S1
Fig. 3Going beyond k-mers: This figure shows the location, on the katG gene, of each k-mer targeted by the isoniazid model (rule and equivalent rules). All the k-mers overlap a concise locus, suggesting that it contains a point mutation that is associated with the phenotype. A multiple sequence alignment revealed a high level of polymorphism at codon 315 (shown in red). The wild-type sequence (WT), as well as the resistance conferring variants S315G, S315I, S315N and S315T, were observed. The rule in the model captures the absence of WT and thus, includes the occurrence of all the observed variants
Fig. 4Overcoming spurious correlations: This figures shows how spurious correlations in the M. tuberculosis data affect the models produced by the Set Covering Machine. a For each M. tuberculosis dataset, the proportion of isolates that are identically labeled in each other dataset is shown. This proportion is calculated using Eq. (2). b The antibiotic resistance models learned by the SCM at each iteration of the correlation removal procedure. Each model is represented by a rounded rectangle identified by the round number and the estimated error rate. All the models are disjunctions (logical-OR). The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule
Fig. 5The k-mer representation: An example of the k-mer representation. Given the set of observed k-mers and a genome x, the corresponding vector representation is given by (x)