| Literature DB >> 20885779 |
João D Ferreira1, Francisco M Couto.
Abstract
With the increasing amount of data made available in the chemical field, there is a strong need for systems capable of comparing and classifying chemical compounds in an efficient and effective way. The best approaches existing today are based on the structure-activity relationship premise, which states that biological activity of a molecule is strongly related to its structural or physicochemical properties. This work presents a novel approach to the automatic classification of chemical compounds by integrating semantic similarity with existing structural comparison methods. Our approach was assessed based on the Matthews Correlation Coefficient for the prediction, and achieved values of 0.810 when used as a prediction of blood-brain barrier permeability, 0.694 for P-glycoprotein substrate, and 0.673 for estrogen receptor binding activity. These results expose a significant improvement over the currently existing methods, whose best performances were 0.628, 0.591, and 0.647 respectively. It was demonstrated that the integration of semantic similarity is a feasible and effective way to improve existing chemical compound classification systems. Among other possible uses, this tool helps the study of the evolution of metabolic pathways, the study of the correlation of metabolic networks with properties of those networks, or the improvement of ontologies that represent chemical information.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20885779 PMCID: PMC2944781 DOI: 10.1371/journal.pcbi.1000937
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Chemical structure of two semantically related compounds.
The two represented molecular structures, clavulanic acid (A) and 3-carboxyphenyl phenylacetamidomethylphosphonate (B), are different, and yet they both inhibit -lactamase.
Performance of previous works.
| Dataset | Classification system | Accuracy | Reference |
| BBB | Artificial Neural Networks | 75.7% |
|
| Random Forest | 80.9% |
| |
| Support Vector Machines | 81.5% |
| |
| P-gp | Four-point Pharmacophore | 62.7% |
|
| Support Vector Machines | 79.4% |
| |
| Random Forest | 80.6% |
| |
| estrogen | Decision Forest |
|
|
| Random Forest | 82.8% |
|
This table summarizes the performance of several classification methods used on the BBB, P-gp and estrogen problems.
Fraction of compounds in the ChEBI ontology.
| Testing set | ChEBI coverage | ||
| active | inactive | overall | |
| BBB | 74/180 | 79/144 | 47.2% |
| P-gp | 57/109 | 24/87 | 41.3% |
| estrogen | 42/132 | 59/101 | 43.3% |
Fraction of names found in the ChEBI ontology for each set of molecules. Coverage for active and inactive compounds is detailed.
Replication of the results of BBB.
| Set | Approach | Validation method | Accuracy | MCC |
| BBB | SVM | LMO25 | 81.3% | 0.630 |
| BBB | SVM | LMO25 | 73.8% | 0.484 |
| BBB | Chym | LMO25 | 89.6% | 0.800 |
| BBB | SVM | 10-fold | 81.2% | 0.625 |
| BBB | SVM | 10-fold | 74.1% | 0.492 |
| BBB | Chym | 10-fold | 90.0% | 0.810 |
For the LMO25 method, the accuracy values are the mean of 30 experiments, as explained in the previous section. The Chym results were obtained for FP3 fingerprint format, simGIC semantic method using the entire ontology, and .
Results of the classification system derived from the Chym comparison method.
| Set | Chym | Best previous attempt | ||||
| Parameters | MCC | Accuracy | Approach | MCC | Accuracy | |
| BBB | FP3, simGIC, all, 0.29 | 0.810 | 90.0% | SVM | 0.628 | 81.5% |
| P-gp | FP4, simUI, role, 0.72 | 0.694 | 87.3% | Random Forests | 0.591 | 80.6% |
| estrogen | FP4, simGIC, role, 0.45 | 0.673 | 82.6% | Random Forests | 0.647 | 82.8% |
Chym parameters are “fingerprint format, semantic method, branch of the ontology used, ”. The validation process used was 10-fold. Matthews Correlation Coefficient values reported here was not directly retrieved from the papers where the attempts are described, but were estimated based on the values of true positives, false positives, true negatives and false negatives given in those papers.
The effect of the alpha parameter in Chym's performance as measured by the Matthews Correlation Coefficient.
| Alpha | BBB | P-gp | estrogen |
| 0.0 | 0.66837 | 0.47723 | 0.26418 |
| 0.1 | 0.74508 | 0.54799 | 0.33957 |
| 0.2 | 0.78206 | 0.54634 | 0.42900 |
| 0.3 |
| 0.63492 | 0.50817 |
| 0.4 | 0.75904 |
| 0.60167 |
| 0.5 | 0.73267 | 0.61939 | 0.63670 |
| 0.6 | 0.68652 | 0.60764 | 0.66318 |
| 0.7 | 0.64528 | 0.57530 |
|
| 0.8 | 0.57281 | 0.54896 | 0.60161 |
| 0.9 | 0.52186 | 0.49979 | 0.64252 |
| 1.0 | 0.51764 | 0.48429 | 0.61333 |
The Chym parameters used are the ones in Table 4, except that instead of a single value, we present results for several values. Validation was performed with a 10-fold approach. Bold values are the maximum for each column.
Figure 2The effect of the parameter in the performance of Chym.
For each dataset, the best metric was stripped of the parameter. This incomplete metric was completed with all values of and then each one was used to determine performance. The figure shows the variation of performance (as measured by the Matthews Correlation Coefficient) against the value of . There is a maximum in the plot for every dataset, consisting of the best metric: , , for the BBB (red open circles), P-gp (green closed circles) and estrogen (blue closed squares) datasets respectively).
The activity coefficients of the most active compounds in ChEBI when compared to the active compounds in each set.
| Set | Rank | Compound | Coefficient | Ref. | |
| ID | Name | ||||
| BBB | 1 | 50931 | (Z)-chlorprothixene | 0.289 |
|
| BBB | 2 | 51137 | mianserin | 0.280 |
|
| BBB | 3 | 251412 | adinazolam | 0.279 |
|
| P-gp | 7 | 53290 | (S)-donepezil | 0.373 |
|
| P-gp | 15 | 31181 | aklavinone | 0.368 |
|
| P-gp | 16 | 48723 | (-)-lobeline | 0.366 |
|
| estrogen | 2 | 27917 | luteone | 0.277 |
|
| estrogen | 4 | 5262 | galangin | 0.274 |
|
| estrogen | 5 | 50399 | 3′,4′,7-trihydroxyisoflavone | 0.274 |
|
For each compound, a reference showing that the compound is indeed active is given. The thresholds for each problem, as determined by the algorithm detailed in the Methodology section, are 0.243 (BBB), 0.272 (P-gp) and 0.231 (estrogen).