| Literature DB >> 35255958 |
Arash Keshavarzi Arshadi1, Milad Salem2, Arash Firouzbakht3, Jiann Shiun Yuan2.
Abstract
Deep learning's automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.Entities:
Keywords: Artificial intelligence; Benchmark; Big data; Biological assays; Database; Drug discovery; Machine learning; PubChem
Year: 2022 PMID: 35255958 PMCID: PMC8899453 DOI: 10.1186/s13321-022-00590-y
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The pipeline of MolData benchmark creation
Data
source summary. MolData was created using 9 data sources, the number of bioassays within each data source is shown in AID count column. Each molecule in a given source can have bioactivity for multiple bioassays and constitute multiple data points. Unique active molecules are defined as molecules that demonstrate bioactivity in at least one bioassay
| PubChem Source | AID count | Active data points | Total data points | % Active datapoints | Unique active molecules | Total unique molecules | % Unique active molecules |
|---|---|---|---|---|---|---|---|
| Broad Institute | 67 | 125,627 | 22.2 m | 0.56% | 85,579 | 472,858 | 18.1% |
| Burnham Center for Chemical Genomics | 67 | 139,021 | 21.9 m | 0.63% | 77,159 | 381,794 | 20.21% |
| Emory University Molecular Libraries Screening Center | 12 | 24,195 | 2.47 m | 0.98% | 20,964 | 348,231 | 6.02% |
| ICCB-Longwood Screening Facility, Harvard Medical School | 11 | 8358 | 2.1 m | 0.39% | 6656 | 564,021 | 1.18% |
| Johns Hopkins Ion Channel Center | 22 | 48,545 | 6.8 m | 0.71% | 35,487 | 344,497 | 10.30% |
| NMMLSC | 42 | 48,186 | 11.5 m | 0.42% | 37,949 | 369,431 | 10.27% |
| National Center for Advancing Translational Sciences (NCATS) | 174 | 720,319 | 53.4 m | 1.35% | 240,096 | 592,616 | 40.51% |
| The Scripps Research Institute Molecular Screening Center | 148 | 275,224 | 47.6 m | 0.58% | 142,055 | 920,418 | 15.43% |
| Tox21 | 57 | 21,475 | 0.47 m | 5.67% | 4183 | 8743 | 47.84% |
Fig. 2Map of the bioassays’ descriptions using the output of the BioBERT model (The explained variance ratio for the first and second principal components are 14.72% and 5.40% respectively)
Disease-based information for the MolData Benchmark
| Tag | AID count | Active data points | Total data points | % Active datapoints | Unique active molecules | Total unique molecules | % Unique active molecules |
|---|---|---|---|---|---|---|---|
| All Categories | 600 | 1,410,950 | 168,345,532 | 0.84 | 672,935 | 1,429,989 | 47.06 |
| Cancer | 236 | 575,454 | 68,649,771 | 0.84 | 230,049 | 1,323,311 | 17.38 |
| Nervous System | 174 | 378,812 | 54,753,975 | 0.69 | 170,353 | 651,249 | 26.16 |
| Immune system | 129 | 322,362 | 38,418,661 | 0.84 | 157,333 | 579,658 | 27.14 |
| Cardiovascular | 94 | 212,162 | 28,660,627 | 0.74 | 124,270 | 542,902 | 22.89 |
| Toxicity | 54 | 48,653 | 2,452,656 | 1.98 | 30,936 | 487,219 | 6.35 |
| Obesity | 53 | 90,837 | 14,516,199 | 0.63 | 65,993 | 545,513 | 12.1 |
| Virus | 47 | 113,946 | 14,679,312 | 0.78 | 81,702 | 621,945 | 13.14 |
| Diabetes | 43 | 61,408 | 11,645,151 | 0.53 | 47,830 | 543,600 | 8.8 |
Metabolic Disorders | 42 | 126,772 | 9,985,491 | 1.27 | 70,665 | 527,382 | 13.4 |
| Bacteria | 40 | 132,593 | 12,314,737 | 1.08 | 89,554 | 1,290,782 | 6.94 |
| Parasite | 24 | 98,950 | 7,302,206 | 1.36 | 75,027 | 500,228 | 15 |
| Epigenetics, Genetics | 23 | 92,837 | 6,815,597 | 1.36 | 65,244 | 439,537 | 14.84 |
| Pulmonary | 19 | 45,940 | 6,122,297 | 0.75 | 36,467 | 524,167 | 6.96 |
| Infection | 11 | 93,444 | 3,312,920 | 2.82 | 63,782 | 521,473 | 12.23 |
| Aging | 10 | 9030 | 3,079,580 | 0.29 | 8527 | 511,471 | 1.67 |
| Fungal | 7 | 9253 | 2,147,751 | 0.43 | 8824 | 444,373 | 1.99 |
Target-based information for the MolData Benchmark
| Target | AID count | Unique target count | Active data points | Total data points | % Active datapoints | Unique active molecules | Total unique molecules | % Unique active molecules |
|---|---|---|---|---|---|---|---|---|
| All Targets | 383 | 296 | 862,370 | 103,440,515 | 0.83 | 261,715 | 675,161 | 38.76 |
| Membrane receptor | 85 | 44 | 146,956 | 25,922,533 | 0.56 | 91,489 | 458,818 | 19.94 |
| Enzyme (other) | 54 | 51 | 83,657 | 16,210,090 | 0.51 | 57,808 | 632,142 | 9.14 |
| Nuclear receptor | 53 | 25 | 74,776 | 6,083,509 | 1.22 | 42,838 | 442,487 | 9.68 |
| Hydrolase | 36 | 32 | 113,185 | 10,830,324 | 1.05 | 66,195 | 526,391 | 12.57 |
| Protease | 29 | 26 | 37,943 | 7,965,313 | 0.47 | 30,619 | 606,793 | 5.05 |
| Transcription factor | 27 | 18 | 53,416 | 4,775,685 | 1.11 | 40,067 | 503,249 | 7.96 |
| Kinase | 24 | 23 | 38,257 | 7,369,690 | 0.52 | 31,327 | 377,519 | 8.29 |
| Epigenetic regulator | 23 | 20 | 76,793 | 6,840,095 | 1.12 | 51,776 | 523,904 | 9.88 |
| Ion channel | 22 | 14 | 37,402 | 6,745,762 | 0.55 | 28,853 | 511,873 | 5.63 |
| Transferase | 18 | 17 | 43,955 | 6,279,651 | 0.7 | 30,432 | 519,646 | 5.85 |
| Oxidoreductase | 10 | 8 | 33,956 | 2,953,760 | 1.15 | 30,054 | 432,578 | 6.94 |
| Transporter | 9 | 8 | 15,390 | 2,538,579 | 0.60 | 15,046 | 369,621 | 4.07 |
| NTPase | 6 | 5 | 114,465 | 1,981,575 | 5.78 | 76,334 | 439,967 | 17.34 |
| Phosphatase | 5 | 5 | 8090 | 1,693,773 | 0.48 | 6913 | 368,329 | 1.87 |
Fig. 3The composition of the disease benchmark: Number of bioassays for each disease category (as well as toxicity) and their original source
Fig. 4The composition of the target benchmark: Number of bioassays for each target category and their original source
Fig. 5Cumulative histogram of the largest Tanimoto Similarity Coefficient for 200,000 molecules within the MolData dataset. More than 92% of the molecules have other similar molecules to them within the dataset with Tanimoto Coefficient of higher than 0.5, and more than 44% of the molecules for Coefficient of 0.7
Fig. 6Correlation matrix between a Toxicity bioassays, b Non-Toxic bioassays with correlation > 0.5
Classification results on the validation and test sets of disease categories, averaged on all tasks within each category
| Benchmark | Validation Set | Test Set | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Recall (%) | Precision (%) | ROC AUC | Accuracy (%) | Recall (%) | Precision (%) | ROC AUC | |
| All Tasks | 64.7 | 76 | 3.93 | 0.7803 | 63.96 | 75.69 | 3.98 | 0.774 |
| Cancer | 73.61 | 68.76 | 3.68 | 0.7809 | 72.96 | 68.44 | 3.76 | 0.7765 |
| Nervous System | 73.34 | 65.1 | 2.39 | 0.7573 | 73.01 | 64.92 | 2.49 | 0.7556 |
| Immune System | 79.7 | 61.34 | 3.41 | 0.777 | 79.49 | 61.01 | 3.5 | 0.7739 |
| Cardiovascular | 80.06 | 56.84 | 2.98 | 0.7498 | 80.06 | 56.39 | 3.13 | 0.7457 |
| Toxicity | 86.9 | 33.41 | 24.46 | 0.7445 | 86.51 | 34.27 | 27.54 | 0.7309 |
| Obesity | 86.01 | 54.1 | 5.42 | 0.7925 | 85.37 | 51.5 | 5.51 | 0.7704 |
| Virus | 77.73 | 62.08 | 2.62 | 0.7625 | 77.9 | 61.91 | 2.84 | 0.7643 |
| Diabetes | 86.69 | 51.27 | 5.8 | 0.7845 | 85.88 | 51.33 | 5.99 | 0.7795 |
| Metabolic Disorders | 83.14 | 53.04 | 6.89 | 0.7619 | 82.71 | 54.85 | 7.06 | 0.7619 |
| Bacteria | 83.1 | 60.82 | 4.63 | 0.7916 | 82.26 | 64.49 | 4.69 | 0.8089 |
| Parasite | 91.51 | 46.65 | 11.31 | 0.8292 | 91.37 | 44.63 | 11.17 | 0.8243 |
| Epigenetics-Genetics | 88.46 | 45.27 | 6.36 | 0.7804 | 88.32 | 40.98 | 5.65 | 0.7251 |
| Pulmonary | 76.82 | 56.7 | 2.34 | 0.7293 | 76.06 | 54.79 | 2.5 | 0.7168 |
| Infection | 92.17 | 31.58 | 12.53 | 0.801 | 92.01 | 29.87 | 11.41 | 0.7871 |
| Aging | 94.83 | 23.59 | 1.86 | 0.7205 | 94.28 | 29.36 | 2.38 | 0.7402 |
| Fungal | 92.36 | 35.22 | 3.5 | 0.75 | 92.77 | 33.93 | 3.61 | 0.7335 |
Classification results on the validation and test sets of target categories, averaged on all tasks within each category
| Target Benchmark | Validation set | Test set | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Recall (%) | Precision (%) | ROC AUC | Accuracy (%) | Recall (%) | Precision (%) | ROC AUC | |
| All Tasks w/Targets | 73.89 | 69.7 | 4.81 | 0.789 | 73.54 | 68.6 | 4.86 | 0.7786 |
| Membrane receptor | 79.67 | 56.29 | 2.45 | 0.7479 | 79.77 | 53.54 | 2.44 | 0.7251 |
| Enzyme (other) | 85.32 | 60.81 | 2.86 | 0.8123 | 85.05 | 60.2 | 2.98 | 0.8019 |
| Nuclear receptor | 84.45 | 47.15 | 16.98 | 0.7516 | 84.95 | 45.96 | 18.23 | 0.7511 |
| Hydrolase | 84.35 | 65.1 | 3.35 | 0.8165 | 84.09 | 59.5 | 3.43 | 0.7879 |
| Protease | 85.88 | 51.43 | 3.9 | 0.7874 | 85.21 | 52.53 | 3.89 | 0.7792 |
| Transcription factor | 86.72 | 49.45 | 12.59 | 0.7947 | 85.35 | 51.9 | 13.37 | 0.7858 |
| Kinase | 77.56 | 51.29 | 2.34 | 0.7142 | 77.53 | 49.77 | 2.33 | 0.6955 |
| Epigenetic regulator | 86.22 | 58.25 | 6.12 | 0.8097 | 85.55 | 55.25 | 6.23 | 0.8077 |
| Ion channel | 96.82 | 22.63 | 6.81 | 0.7496 | 96.68 | 21.13 | 6.16 | 0.7393 |
| Transferase | 93.32 | 45.94 | 8.43 | 0.837 | 92.86 | 42.59 | 7.75 | 0.8234 |
| Oxidoreductase | 93.25 | 25.42 | 19.1 | 0.7977 | 92.77 | 26.4 | 11.27 | 0.7994 |
| Transporter | 94.93 | 19.21 | 3.67 | 0.7008 | 94.86 | 16.88 | 3.97 | 0.663 |
| NTPase | 93.68 | 10.45 | 41.43 | 0.7479 | 93.33 | 9.59 | 26.78 | 0.7639 |
| Phosphatase | 98.53 | 21.86 | 15.27 | 0.7993 | 98.45 | 19.13 | 14.48 | 0.815 |
| Activity of N-methyl-D-aspartate subtype of glutamate receptor (NMDAR) is essential for normal central nervous system (CNS) function. However, excessive activation of NMDAR mediates, at least in part, neuronal or synaptic damage in many neurological disorders, including hypoxic-ischemic brain injury and in Down syndrome. The dual role of NMDARs in normal and abnormal CNS function imposes important constraints on possible therapeutic strategies aimed at ameliorating or abating developmental disorders and neurological disease: blockade of excessive NMDAR activity must be achieved without interference with its normal function. We propose an approach for NMDAR modulation via modulation of the NR3A subunit, a representative of a novel family of NMDAR subunits with the goal to modulate the NMDAR activity. NR3 subunits have a unique structure in their M3 domain forming part of the channel region that contributes to decreased magnesium sensitivity and calcium permeability of NMDARs. It potently and specifically binds glycine and D-serine, but not glutamate. In addition, we have shown that glycine binding to the ligand-binding domain (LBD) of NR3A is essential for NR1/NR3 receptor activation, as opposed to internalization caused by ligand binding to NR1 LBD. |