| Literature DB >> 31182025 |
George S Long1, Mohammed Hussen1, Jonathan Dench1, Stéphane Aris-Brosou2,3.
Abstract
BACKGROUND: A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known.Entities:
Keywords: Drug resistance; Genome-wide association study; Influenza virus; Machine learning; Pseudomonas aeruginosa
Mesh:
Year: 2019 PMID: 31182025 PMCID: PMC6558885 DOI: 10.1186/s12864-019-5820-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Sensitivity of the algorithms on the analysis of the influenza data
| Site | AB | RRF | Phenotype | References |
|---|---|---|---|---|
| PB2 9 | ✓ | ✓ | Infectivity | [ |
| PB2 105 | ✓ | ✓ | Pathogenicity/Infectivity | [ |
| PB2 339 | ✓ | ✓ | Infectivity | [ |
| PB2 391 | ✓ | Transmissibility | [ | |
| PB2 627 | ✓ | ✓ | Infectivity | [ |
| PB2 667 | ✓ | Infectivity | [ | |
| PB1 215 | ✓ | ✓ | Pathogenicity | [ |
| PB1 375 | ✓ | Pathogenicity | [ | |
| PB1 757 | ✓ | Infectivity | [ | |
| HA 163 | ✓ | Pathogenicity/Infectivity | [ | |
| HA 212 | ✓ | Pathogenicity/Infectivity | [ | |
| HA 246 | ✓ | ✓ | Transmissibility | [ |
| HA 536 | ✓ | Infectivity | [ | |
| NP 400 | Pathogenicity | [ | ||
| NA 49 | ✓ | Transmissibility | [ | |
| NA 75 | Transmissibility | [ | ||
| M2 31 | ✓ | Pathogenicity/Infectivity | [ | |
| NS1 127 | Pathogenicity | [ | ||
| NS1 195 | Transmissibility/Infectivity | [ | ||
| NS1 212 | ✓ | Pathogenicity/Infectivity | [ |
This table lists the genes and amino acid positions known to be involved in the three phenotypes studied here, and which one of these were rediscovered by our algorithms. For AB, chunk sizes of 75, 125, and 175 were used to calculate the importance values of each site for adaptive boosting. An importance threshold of 1 was used to determine whether a site was a potential genetic determinant. For RRF, chunk sizes of 80, 125, and 175 were used with a threshold of the 90th percentile and a 60% consensus. Data on experimental validations are from the Influenza Research Database [24]. Genes are ordered by segment size. See Figs. 2 and 3 for the specificity of these algorithms
Fig. 2Effect of chunk size on the distribution of importance of sites for the AB algorithm. The genes and sites identified as genetic determinants of influenza phenotypes are shown for: a infectivity, b transmissibility, and c pathogenicity. Only results for the smallest (75 amino acids), intermediate (125), and largest (175) chunk sizes are shown. Only the most important sites (importance >1.5) are shown in each panel, with sites backed by with experimental evidence highlighted in red. Insets show the whole distribution of importance values (left and right columns), and the Venn diagrams of the most important sites at all three chunk size (middle column)
Fig. 3Effect of chunk size on the distribution of importance of sites for the RRF algorithm. The genes and sites identified as genetic determinants of influenza phenotypes are shown for: a infectivity, b transmissibility, and c pathogenicity. Only results for the smallest (80 amino acids), intermediate (125), and largest (175) chunk sizes are shown. Only the most important sites (Gini index in top 90th percentile of its distribution over all the sites) are shown in each panel, with sites backed by with experimental evidence highlighted in red. Insets show the whole distribution of importance values (left and right columns), and the Venn diagrams of the most important sites at all three chunk size (middle column)
Fig. 1Impact of chunk size on the runtime of the machine learning algorithms for the influenza data. Runtimes for infectivity (red), transmissibility (blue), and pathogenicity (orange) are shown for AB (a) and RRF (b). While each data point is based on a single run, run-to-run variability is taken into account by performing linear regressions (solid lines); their P-values are also shown
List of the Influenza A strains and their associated phenotypes, as used in the training of the machine learning algorithms
| Strain name | Infectivity | Transmissibility | Pathogenicity |
|---|---|---|---|
| A/HongKong/156/97 | No | Yes | Yes |
| A/HongKong/213/2003 | No | Yes | Yes |
| A/Indonesia/5/2005 | No | No | No |
| A/Indonesia/7/2005 | No | No | No |
| A/PuertoRico/8/34 | Yes | No | No |
| A/Swine/Indiana/1726/1988 | Yes | Yes | No |
| A/Turkey/15/2006 | No | No | No |
| A/VietNam/1203/2004 | Yes | No | Yes |
| A/VietNam/3046/2004 | Yes | No | Yes |
| A/VietNam/3062/2004 | Yes | No | Yes |
For pathogenicity, polybasic cleavage was used as a proxy
Fig. 4Impact of chunking and data size on sensitivity and specificity. Simulations were conducted to assess the impact of chunking, number of sequences and length of protein alignments on (a) sensitivity and (b) specificity of the RRF algorithm, with no class imbalance, or in the presence of class imbalance (c) and (d), respectively. Similar simulations were conducted under the RF algorithm (e), (f), (g) and (h), respectively, still for protein data, and under RRF for DNA data (i), (j), (k) and (l), respectively
Fig. 5Analysis of the P.aeruginosa data across the 26 strains from [25]. The distributions of MIC values (on a log2 scale) are shown for a Ciprofloxacin, b Ceftazidime, and c Gentamicin. These empirical distributions were used to determine MIC thresholds for the AB analyses (Table 2). Note that the scales on the y-axis vary slightly. With RRF, a throughout search of the discretization was performed to select thresholds θ1 / θ2 that would minimize the Out-of-bag error for d Ciprofloxacin, e Ceftazidime, and f Gentamicin (color scale to the right of each panel). The θ1 / θ2 combinations (red dotted lines, also reported in first row) were determined visually. The top 10% most important sites (as per their Gini index) are highlighted (box with broken lines) among sites selected at the end of the first tier of the chunking algorithm for g Ciprofloxacin, h Ceftazidime, and i Gentamicin. These top sites are listed in the top right part of each distribution
Gene lists of the most important candidates for drug resistance in P.aeruginosa
| Drug | Setting 1 | Setting 2 | Setting 3 |
|---|---|---|---|
| Ciprofloxacin |
| hypothetical protein |
|
|
| HIS/PHE ammonia-lyase |
| |
|
| LysR family transcriptional regulator |
| |
| recF, recombination protein F | glutamine synthetase | ||
| D,D-heptose 1,7-bisphos. phosphatase | hypothetical protein | ||
| Ceftazidime |
|
|
|
|
| sensor/response regulator hybrid |
| |
|
|
| ||
| hemagglutinin | |||
| recQ, ATP-depend. DNA helicase | |||
| Gentamicin |
|
|
|
|
|
|
| |
|
|
|
| |
|
| tufA, elongation factor Tu |
| |
|
|
| ||
| nirN, c-type cytochrome |
Shown are the genes identified in all four runs under the four settings defined in Table 2. For each drug, the genes identified in all three settings are highlighted (boldface), as well as those found in two out of the three settings (italics). Gene names that are underlined (setting 1 only) are those identified during the cross-validation experiment, under both chunk sizes
The different sets of MIC thresholds employed to assess the robustness of the classification results with AB in the case of P.aeruginosa
| Drug | Setting 1 | Setting 2 | Setting 3 |
|---|---|---|---|
| Ciprofloxacin | -1/1.5 | 0/2 | -1/2 |
| Ceftazidime | 4/6 | 6/8 | 4/8 |
| Gentamicin | 4/6 | 6/8 | 4/8 |
Shown are the thresholds θ1 / θ2 used on a log2 MIC scale: for instance, setting 1 for Ciprofloxacin means that MIC is low when ≤−1 (θ1=−1), high when MIC >1.5 (θ2=1.5), and medium in-between
Characterization of RRF drug resistance candidates in the pseudomonas data from the literature
| Drug | Gene | Evidence | References |
|---|---|---|---|
| Ciprofloxacin | pchE | Siderophore family, extracellular iron-acquisition system, upregulation associated with exposure to natural quinolones | [ |
| Iron is required for virulence and is deficient in the human lung environment of these clinical strains | [ | ||
| cupB3 | Part of an outer membrane porin family, mutation that reduce membrane permeability are linked to Gram-negative bacterial mechanisms for antibiotic resistance | [ | |
| permease | Reduces accumulation of drug inside cell by decreased cell wall permeability or by pumping drug out | [ | |
| ABC transporter | See permease | [ | |
| SH3 | Controls numerous protein-protein interactions, some implicated in virulence of pathogenic bacteria | [ | |
| alkB | DNA repair system (fluoroquinolones prevent proper winding and unwinding of DNA during replication), also affects outer membrane lipids - and thus permeability - it may affect antibiotic resistance | [ | |
| sbrR | Anti-sigma factor, identified as necessary during the chronic infection of respiratory tracts | [ | |
| mnmC | Part of tRNA modification, and thus protein synthesis, no obvious antibiotic or lung environment connection | ||
| Ceftazidime | MFS | Membrane translocases, include many multidrug resistant proteins of Gram-positive bacteria | [ |
| pscC | Type III secretion outer membrane protein, probable general resistance candidate | [ | |
| algW | Mutations are known to confer susceptibility to the beta-lactam family of antibiotics | [ | |
| glyA2 | Produces anti-oxidant coenzymes, involved in cell response to TiO2-based nanocomposite antimicrobials | [ | |
| lysR | Associated with minimum inhibitory concentration of antibiotics and oxidative stress chemicals | [ | |
| Gentamicin | rnt | This ribonuclease would have a logical role in degrading 30S bound by the antibiotic | |
| algW | Correlated to Ceftazidime resistance? | ||
| quinone OR | Antibiotic resistance in response to antibiotics that inhibit protein synthesis - including binding of the 50S ribosomal subunit | [ |