| Literature DB >> 34884870 |
Viviana Quevedo-Tumailli1,2, Bernabe Ortega-Tenezaca1,3, Humberto González-Díaz4,5,6.
Abstract
The parasite species of genus Plasmodium causes Malaria, which remains a major global health problem due to parasite resistance to available Antimalarial drugs and increasing treatment costs. Consequently, computational prediction of new Antimalarial compounds with novel targets in the proteome of Plasmodium sp. is a very important goal for the pharmaceutical industry. We can expect that the success of the pre-clinical assay depends on the conditions of assay per se, the chemical structure of the drug, the structure of the target protein to be targeted, as well as on factors governing the expression of this protein in the proteome such as genes (Deoxyribonucleic acid, DNA) sequence and/or chromosomes structure. However, there are no reports of computational models that consider all these factors simultaneously. Some of the difficulties for this kind of analysis are the dispersion of data in different datasets, the high heterogeneity of data, etc. In this work, we analyzed three databases ChEMBL (Chemical database of the European Molecular Biology Laboratory), UniProt (Universal Protein Resource), and NCBI-GDV (National Center for Biotechnology Information-Genome Data Viewer) to achieve this goal. The ChEMBL dataset contains outcomes for 17,758 unique assays of potential Antimalarial compounds including numeric descriptors (variables) for the structure of compounds as well as a huge amount of information about the conditions of assays. The NCBI-GDV and UniProt datasets include the sequence of genes, proteins, and their functions. In addition, we also created two partitions (cassayj = caj and cdataj = cdj) of categorical variables from theChEMBL dataset. These partitions contain variables that encode information about experimental conditions of preclinical assays (caj) or about the nature and quality of data (cdj). These categorical variables include information about 22 parameters of biological activity (ca0), 28 target proteins (ca1), and 9 organisms of assay (ca2), etc. We also created another partition of (cprotj = cpj) including categorical variables with biological information about the target proteins, genes, and chromosomes. These variables cover32 genes (cp0), 10 chromosomes (cp1), gene orientation (cp2), and 31 protein functions (cp3). We used a Perturbation-Theory Machine Learning Information Fusion (IFPTML) algorithm to map all this information (from three databases) into and train a predictive model. Shannon's entropy measure Shk (numerical variables) was used to quantify the information about the structure of drugs, protein sequences, gene sequences, and chromosomes in the same information scale. Perturbation Theory Operators (PTOs) with the form of Moving Average (MA) operators have been used to quantify perturbations (deviations) in the structural variables with respect to their expected values for different subsets (partitions) of categorical variables. We obtained three IFPTML models using General Discriminant Analysis (GDA), Classification Tree with Univariate Splits (CTUS), and Classification Tree with Linear Combinations (CTLC). The IFPTML-CTLC presented the better performance with Sensitivity Sn(%) = 83.6/85.1, and Specificity Sp(%) = 89.8/89.7 for training/validation sets, respectively. This model could become a useful tool for the optimization of preclinical assays of new Antimalarial compounds vs. different proteins in the proteome of Plasmodium.Entities:
Keywords: Antimalarial compounds; ChEMBL; NCBI-GDV; Plasmodium proteome; UniProt; complex networks; machine learning; perturbation theory
Mesh:
Substances:
Year: 2021 PMID: 34884870 PMCID: PMC8657696 DOI: 10.3390/ijms222313066
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1General Workflow of the steps given in this work.
Figure 2Variables pre-processing vs. post-processing.
IFPTML-GDA model result.
| Observed | Statistical | Predicted | Predicted Sets | ||
|---|---|---|---|---|---|
| Setsa | Parameterb | Statistics | nj | ||
| Training Series | |||||
| Sp(%) | 98.8 | 13,087 | 12,934 | 153 | |
| Sn(%) | 65.9 | 232 | 79 | 153 | |
| total | Ac(%) | 98.3 | 13,319 | ||
| External Validation Series | |||||
| Sp(%) | 98.7 | 4365 | 4310 | 55 | |
| Sn(%) | 66.2 | 74 | 25 | 49 | |
| total | Ac(%) | 98.2 | 4439 | ||
aThe observed classification classes are two: drugs with a desired level of biological effect observed f(vij)obs= 1 or f(vij)obs= 0 otherwise. b Sn (%) = Sensitivity, Sp (%) = Specificity and AC (%) = Accuracy.
Selected values of multi-condition averages for different combinations of assay conditions.
| c0 = Activity (Units) | Cut-off(c0) | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 10 | 25 | 50 | 75 | 95 | 100 | 200 | ||
| Inhibition (%) | 9785 | 1535 | 564 | 376 | 228 | 78 | 39 | - | 13,469 |
| IC50 (nM) | 2 | 29 | 49 | 81 | 101 | 108 | 110 | 133 | 3715 |
| Ki (nM) | 24 | 78 | 100 | 120 | 132 | 134 | 138 | 160 | 369 |
| Other Activities | 59 | 133 | 146 | 148 | 150 | 149 | 150 | 152 | 205 |
| n( | 9870 | 1775 | 859 | 725 | 611 | 469 | 437 | 445 | 17,758 |
| n( | 7888 | 15,983 | 16,899 | 17,033 | 17,147 | 17,289 | 17,321 | 17,313 | |
Figure 3IFPTML-CTUS model decision tree.
IFPTML-CTUS model coefficients.
| Class | Left | Right | Control | Active | Predict. | Split | Split |
|---|---|---|---|---|---|---|---|
| Node | Branch | Branch | n(f(vij) = 0) | n(f(vij) = 1) | Class | Constant | Variable |
| 1 | 2 | 3 | 13,087 | 232 | 0 | 0.11321607 | |
| 2 | 4 | 5 | 12,903 | 72 | 0 | 0.02505894 | |
| 3 | 184 | 160 | 1 | -- | |||
| 4 | 6 | 7 | 12,542 | 56 | 0 | 0.00895431 | |
| 5 | 361 | 16 | 1 | -- | |||
| 6 | 2623 | 0 | 0 | -- | |||
| 7 | 8 | 9 | 9919 | 56 | 0 | −0.0982586 | ΔSh(Drug;Halog)2cdj |
| 8 | 10 | 11 | 5006 | 38 | 0 | 2.55375728 | ΔSh(Drug;Csat)1cpj |
| 9 | 12 | 13 | 4913 | 18 | 0 | 1.318866 | ΔSh(Drug;Hx)4cpj |
| 10 | 14 | 15 | 4821 | 33 | 0 | 0.02739699 | ΔSh(Drug;Csat)1cpj |
| 11 | 185 | 5 | 1 | -- | |||
| 12 | 16 | 17 | 4809 | 17 | 0 | 1.01671015 | ΔSh(Drug;Hx)4cpj |
| 13 | 104 | 1 | 0 | -- | |||
| 14 | 2681 | 17 | 0 | -- | |||
| 15 | 18 | 19 | 2140 | 16 | 0 | 1.87205633 | ΔSh(Drug;Csat)5caj |
| 16 | 4726 | 15 | 0 | -- | |||
| 17 | 83 | 2 | 1 | -- | |||
| 18 | 1868 | 11 | 0 | -- | |||
| 19 | 272 | 5 | 1 | -- |
Figure 4IFPTML-CTLC model decision tree.
IFPTML-CTLC model coefficients.
| Var | Coeff. | f(vij)01 | f(vij)02 | f(vij)03 | f(vij)04 | f(vij)05 | f(vij)06 | Mean | S.D. |
|---|---|---|---|---|---|---|---|---|---|
| Split const. | a00 | −0.005 | −0.024 | −0.024 | −0.010 | −0.071 | −0.077 | −0.04 | 0.03 |
| a01 | 0.044 | 0.762 | 0.751 | 0.818 | 2.678 | 2.881 | 1.32 | 1.17 | |
| ΔSh(Drug;Csat)5caj | a02 | 0.000 | 0.008 | −0.001 | −0.003 | −0.008 | −0.007 | 0.00 | 0.01 |
| ΔSh(Drug;Hetero)5caj | a03 | −0.001 | −0.010 | −0.042 | −0.033 | −0.103 | −0.143 | −0.06 | 0.06 |
| ΔSh(Drug;Hx)1caj | a04 | 0.001 | 0.020 | 0.047 | 0.047 | 0.120 | 0.160 | 0.07 | 0.06 |
| ΔSh(Drug;Csat)1cpj | a05 | 0.001 | 0.014 | 0.020 | 0.023 | 0.083 | 0.093 | 0.04 | 0.04 |
| ΔSh(Drug;Hetero)4cpj | a06 | 0.001 | 0.009 | 0.036 | 0.028 | 0.078 | 0.109 | 0.04 | 0.04 |
| ΔSh(Drug;Hx)4cpj | a07 | −0.001 | −0.017 | −0.038 | −0.037 | −0.092 | −0.117 | −0.05 | 0.04 |
| ΔSh(Drug;Csat)1cdj | a08 | −0.001 | −0.019 | −0.016 | −0.017 | −0.065 | −0.079 | −0.03 | 0.03 |
| ΔSh(Drug;Halog)1cdj | a09 | 0.003 | 0.057 | 0.087 | 0.088 | 0.713 | 0.577 | 0.25 | 0.31 |
| ΔSh(Drug;Halog)2cdj | a10 | −0.003 | −0.059 | −0.094 | −0.095 | −0.740 | −0.609 | −0.27 | 0.32 |
| ΔSh(Chr;Gen)5caj | a11 | 0.000 | 0.000 | 0.002 | 0.002 | 0.039 | 0.075 | 0.02 | 0.03 |
| ΔSh(Prot;Seq)5cdj | a12 | 0.000 | 0.004 | −0.002 | −0.003 | 0.008 | 0.024 | 0.01 | 0.01 |
Comparison of models with different algorithms.
| Algorithm | Set | Class | Stat Param. | Value | f(vij)pred = 0 | f(vij)pred = 1 |
|---|---|---|---|---|---|---|
| IFPTML | Train | Sp | 98.8 | 12,934 | 153 | |
| Sn | 65.9 | 79 | 153 | |||
| Validation | Sp | 98.7 | 4310 | 55 | ||
| Sn | 66.2 | 25 | 49 | |||
| IFPTML | Train | Sp | 91.7 | 12,002 | 1085 | |
| Sn | 81.0 | 44 | 188 | |||
| Validation | Sp | 91.6 | 3997 | 368 | ||
| Sn | 82.4 | 13 | 61 | |||
| Train | Sp | 89.8 | 11,751 | 1336 | ||
| IFPTML | Sn | 83.6 | 38 | 194 | ||
| π0 = 0.50 | Validation | Sp | 89.7 | 3917 | 448 | |
| Sn | 85.1 | 11 | 63 |
Figure 5An example of the IFPTML-CTLC model.
Figure 6IFPTML model development and IF process.
More relevant functions used in the data pre-processing stage.
| Variable | Excel Functions Syntax | Notes |
|---|---|---|
| nj(ca0) | =COUNTIF(Range(ca0), Criteria(ca0)) | Function that determines the total number of cases for each Biological activity in the dataset. |
| <vij(ca0)> | =AVERAGEIF (Range(ca0), Criteria(ca0), Range(vij)) | Calculates the average of all the standard values of biological activity in the dataset. It is used as an argument for the cutoff(ca0) function. |
| cutoff(ca0) | =IF(Units(ca0) = %, 95, IF(Units(ca0) = nM, 10, <vij(ca0)>) | The cutoff value is used to decide if the compounds is active or not. For the values of Activity(%) and Inhibition(%), the cutoff(ca0) = 95%.Similarly, for the IC50(nM), Ki(nM), and Km(nM), the cutoff(ca0) = 10 nM, etc. |
| d(ca0) | =OR(d(ca0) = 1, d(ca0) = −1) | Indicates that the measured parameter increases or decreases directly with a desired or not desired biological effect. |
| =IF(AND(vij> cutoff(ca0), d(ca0) = 1), 1, IF(AND(vij ≤ cutoff(ca0), d(ca0) = −1), 1, 0)) | ||
| n( | =COUNTIF(Range(ca0), Criteria(ca0), Range( | Function that determines the total number of each Biological activity in the dataset and f(vij)obs equal to 1. |
| =n( | The function of reference |
Figure 7Illustration of different representations to represent multiple molecular systems.
Partitions and levels (unique values) taken by the categorical (not ordered) input variables.
| Partition | Var. | Information | NLa | Unique Levels |
|---|---|---|---|---|
| cassayj | ca0 | Biological activity | 22 | Inhibition(%); IC50(nM); Ki(nM); IC50(ug.mL−1); BHIA50(-); IC50(mill equivalent); FC(-); Kinact(/min); Activity(%); VAR(-); Ratio(-); Ratio(/M/s); IC50(molar ratio); Ratio IC50(-); Mean(pM mg−1); GST activity (mU mg−1); Km(nM); Ratio(/s/M); Activity(-); Ka(103/M/s); Kcat(/s); Inhibition(uM) |
| ca1 | UniProt protein accession ID | 28 | Q8MU52; Q3HTL5; Q9NBA7; Q9NFS5; Q8T6J6; Q25856; P39898; Q9N6S8; Q0PJ46; Q6T755; Q8MMZ4; Q868D6; Q25917; Q9GSW0; Q9NAW4; O77078; Q9NAW2; Q9BJJ9; Q8T6B1; Q9N623; Q9XYC7; P05227; P11144; Q17SB2; O77239; Q9Y006; O96214; O97467 | |
| ca2 | Assay Organism | 9 | ||
| cdataj | cd0 | Target mapping | 2 | Protein; Homologous protein |
| cd1 | APD name | 9 | Peptidase C1; Pkinase; Peptidase S8; Asp; OMPdecase; Spermine synth; Sugar tr; Hist deacetyl | |
| cd2 | APD confidence | 2 | ND (No-Data); high | |
| cd3 | Assay type | 2 | Binding (B) = Data measuring binding of compound to a molecular target.Functional (F) = Data measuring the biological effect of a compound. | |
| cd4 | Data curation level | 3 | Autocuration; Intermediate; Expert | |
| cd5 | Confidence score | 2 | 8 = Homologous single protein target assigned. | |
| cprotj | cp0 | Gene | 32 | |
| cp1 | Chromosome | 10 | II; V; VI; VII; VIII; IX; X; XI; XIII; XIV | |
| cp2 | Orientation | 2 | Downstream = −1; Upstream = 1 | |
| cp3 | Protein function (UniProt) | 31 | Glutathione s-transferase, putative; Falcipain-2 precursor; Cysteine protease, putative; Spermidine synthase; Orotidine-monophosphate-decarboxylase, putative; Glucose-6-phosphate isomerase; Plasmepsin, putative; Falcipain 2 precursor; | |
| cp4 | ChEMBL target function type | 5 | Enzyme; Transporter; Epigenetic regulator; Other cytosolic protein; Unclassified Protein |
a NL = Number of Levels (unique values) remaining after pre-processing.
Input variables of the IFPTML models developed.
| Variable | Symbol | Formula | Categorical | Details |
|---|---|---|---|---|
| - | n( | ca0 | Expected value of probability p( | |
| MMAcaj | ΔSh(Drugi)k,caj | Sh(Drugi)k– | caj | Variation (Δ) of the information of the structure of the drugin different subsets of multiple categorical variables related to the pharmacological assay caj. |
| MMAcdj | ΔSh(Drugi)k,cdj | Sh(Drugi)k –⟨ | cdj | Variation (Δ) of the information of the structure of the drugin different subsets of multiple categorical variables related to the nature and/or accuracy of the data measuredcdj. |
| ΔSh (Proti)k,cpj | Sh(Proti)k− ⟨Sh(Prot)k,cpj⟩ | Variation (Δ) of the information of the sequence of the protein, sequence of the gene, and information about the chromosome for different subsets of multiple categorical variables related tothe nature of the protein target cpj. | ||
| MMAcpj | ΔSh (Genei)k,cpj | Sh(Genei)k− ⟨Sh(Gene)k,cpj⟩ | cpj | |
| ΔSh (Chromi)k,cpj | Sh(Chromi)k− ⟨Sh(Chrom)k,cpj⟩ |