| Literature DB >> 18586736 |
Yiqun Cao1, Tao Jiang, Thomas Girke.
Abstract
MOTIVATION: The prediction of biologically active compounds is of great importance for high-throughput screening (HTS) approaches in drug discovery and chemical genomics. Many computational methods in this area focus on measuring the structural similarities between chemical structures. However, traditional similarity measures are often too rigid or consider only global similarities between structures. The maximum common substructure (MCS) approach provides a more promising and flexible alternative for predicting bioactive compounds.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18586736 PMCID: PMC2718661 DOI: 10.1093/bioinformatics/btn186
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Local similarity between compounds. The two structures share a common substructure (dashed boxes). The size difference will result in insignificant scores in 2D fragment-based similarity measures.
Fig. 2.Induced subgraph, common induced subgraph and MCS. Graph (c) is an induced subgraph of graphs (a) and (b). Therefore, it is common induced subgraph of the two graphs. It is also the MCS between them.
Fig. 3.Two example graphs.
Fig. 4.Search tree of a backtracking algorithm in search of the MCS between of the graphs a and b in Figure 3.
Fig. 5.Two example structures. The MCS of the two structures will be five disjoint C–O pairs.
Average AUC values using different prediction models and different training set sizes
| Models | Training set size | |||
|---|---|---|---|---|
| 1% | 5% | 10% | 25% | |
| NCI antiviral dataset | ||||
| MCS-based | 57.9 (3.0) | 64.0 (2.4) | 67.0 (1.3) | 70.0 (0.9) |
| AP-based | 58.2 (3.1) | 63.7 (1.8) | 65.8 (1.8) | 68.9 (1.5) |
| Hybrid | 61.3 (3.4) | 66.7 (1.9) | 69.2 (1.3) | 71.6 (1.2) |
| NCI anticancer dataset | ||||
| MCS-based | 60.3 (2.8) | 65.4 (1.8) | 68.0 (1.7) | 70.9 (1.3) |
| AP-based | 59.3 (3.3) | 65.2 (1.8) | 67.8 (1.7) | 70.9 (1.8) |
| Hybrid | 62.7 (3.2) | 69.2 (1.8) | 71.8 (1.4) | 74.8 (1.2) |
The MCS-based model uses the absolute MCS sizes to represent a chemical structure as a vector. The AP-based model uses the AP-based similarity, and the hybrid model concatenates the vectors from both previous models. SDs are given in parentheses.
Average AUC values using the prediction models based on different MCS coefficients
| Models | Databases | |
|---|---|---|
| NCI antiviral | NCI anticancer | |
| MCS | 69.8 (0.9) | 69.9 (1.3) |
| MCS c1 | 70.0 (1.9) | 71.1 (1.3) |
| MCS c2 | 71.0 (0.9) | 71.0 (1.2) |
| MCS c3 | 70.5 (1.9) | 71.4 (0.9) |
| Hybrid | 71.5 (1.2) | 73.8 (1.2) |
| Hybrid c1 | 71.8 (1.7) | 73.8 (1.2) |
| Hybrid c2 | 72.3 (0.9) | 74.4 (1.1) |
| Hybrid c3 | 72.3 (1.2) | 74.2 (1.3) |
SDs are given in parentheses. The NCI antiviral dataset wastested with a training set of 10 000 compounds and the NCI anticancer dataset was tested with a training set of 5000 compounds. The MCS model uses the absolute MCS sizes. The models MCS c1, MCS c2 and MCS c3 use the MCS coefficients listed in Equations (2), (3) and (4), respectively. The hybrid model uses the absolute MCS sizes and the AP information. The models hybrid c1, hybrid c2 and hybrid c3 use MCS coefficients listed in Equation (2), (3) and (4), respectively, and the AP information. More data corresponding to different training set sizes are listed in Supplementary Table A1.
Average AUC values using the prediction models with different numbers of basis compounds
| Models | Number of basis compounds | ||||||
|---|---|---|---|---|---|---|---|
| 20 | 40 | 60 | 80 | 100 | 120 | 140 | |
| NCI antiviral dataset | |||||||
| AP | 70.7 | 72.4 | 73.3 | 73.9 | 74.0 | 72.9 | 72.9 |
| MCS c2 | 73.0 | 74.6 | 74.6 | 75.8 | 75.5 | 75.4 | 75.2 |
| Hybrid c2 | 74.4 | 75.2 | 75.4 | 76.2 | 76.1 | 75.4 | 75.6 |
| NCI anticancer dataset | |||||||
| AP | 69.5 | 72.4 | 72.6 | 73.2 | 73.9 | 74.7 | 73.7 |
| MCS c2 | 71.0 | 74.2 | 75.2 | 75.5 | 75.9 | 76.6 | 76.4 |
| Hybrid c2 | 74.4 | 75.9 | 75.9 | 76.1 | 76.5 | 77.2 | 76.9 |
For the NCI antiviral dataset, 25 000 randomly selected compounds were used as the training set. For the NCI anticancer dataset, 5000 randomly selected compounds were used as the training set.
AUC values for different prediction models applied to the NCI antiviral dataset
| Models | hybrid c2 | physicochemical-based | descriptor-based | SUBDUE | SubdueCL | FSG |
|---|---|---|---|---|---|---|
| AUC | 82.3 | 47.3 | 72.1 | 58.5 | 65.2 | 79.4 |
Hybrid c2 is our proposed hybrid model using the coefficient from Equation (3) and 80 randomly selected compounds as basis compounds. Radial kernel function-based classifier was used, with C set to 64 and γ set to 0.0625. The physicochemical-based method is described in Deshpande et al. (2005). The descriptor-based method combines 166 MACCS keys from the MDL and Daylight fingerprints. SUBDUE (Holder et al., 1994) and SubdueCL (Gonzalez et al., 2001) are methods based on heuristic substructure discovery. FSG is the method proposed by (Deshpande et al. (2005) using topological subgraphs but not geometrical subgraphs.