| Literature DB >> 33198233 |
Lauv Patel1, Tripti Shukla1, Xiuzhen Huang2, David W Ussery3, Shanzhi Wang1.
Abstract
The advancements of information technology and related processing techniques have created a fertile base for progress in many scientific fields and industries. In the fields of drug discovery and development, machine learning techniques have been used for the development of novel drug candidates. The methods for designing drug targets and novel drug discovery now routinely combine machine learning and deep learning algorithms to enhance the efficiency, efficacy, and quality of developed outputs. The generation and incorporation of big data, through technologies such as high-throughput screening and high through-put computational analysis of databases used for both lead and target discovery, has increased the reliability of the machine learning and deep learning incorporated techniques. The use of these virtual screening and encompassing online information has also been highlighted in developing lead synthesis pathways. In this review, machine learning and deep learning algorithms utilized in drug discovery and associated techniques will be discussed. The applications that produce promising results and methods will be reviewed.Entities:
Keywords: deep learning; drug discovery; in silico screening; machine learning
Mesh:
Year: 2020 PMID: 33198233 PMCID: PMC7696134 DOI: 10.3390/molecules25225277
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The general steps in drug discovery. Machine learning (ML) and deep learning (DL) algorithms may participate in each of the four steps listed, e.g., by mining proteomic in target discovery, discovering small molecules as candidates in lead discovery, developing quantitative structure-activity relationship models to optimize lead structures for improved bioactivity, and analyzing massive assay results.
Databases used for target discovery.
| Databases | Specific Information | Ref. |
|---|---|---|
| BRENDA | Enzyme and enzyme-ligand information source. | [ |
| KEGG | Database containing genomic information for functional interpretation and practical application. | [ |
| PubChem | Database for encompassing information on chemicals and biological activities. | [ |
| TTD | Therapeutic Target Database containing encompassing information about the drug resistance mutations, gene expressions, and target combinations data. | [ |
| DrugBank | Detailed drug data and drug-target information database. | [ |
| SuperTarget | Drug-related information databases with more than >300,000 compound-target protein relations. | [ |
| TDR targets | Database containing chemogenomic information for neglected tropical diseases. | [ |
| STITCH | Chemical-Protein interaction networks. | [ |
| SMD | Database of raw microarray datasets. | [ |
| Gene Expression Omnibus | Database of raw microarray datasets. | [ |
| caArray | Database of cancer-related microarray datasets. | [ |
| CGAP database | Database of cancer-related microarray datasets. | [ |
| Oncomine | Database of cancer-related microarray datasets. | [ |
| UniHI | Database of human molecular interaction networks. | [ |
| Pathguide | Database of 702 biological pathway related resources and molecular interactions. | [ |
| UniProt | Encompassing protein information center. | [ |
| InterPro | Database of protein domain information. | [ |
Web-tools and software utilized in target discovery.
| Web-Tools/Software Used for Target Discovery | Specific Information | Ref. |
|---|---|---|
| GoPubMed | PubMed search engine utilized as a text-mining tool. | [ |
| Textpresso | Full-text engine used in text mining, classification, and search. | [ |
| BioRAT | Full-text search engine used for text mining. | [ |
| ABNER | Molecular biology text analyzer and entity tagger tool. | [ |
| PPICurator | Tool used for mining comprehensive protein-protein interaction. | [ |
| GeneWays | Biological pathway extracting tool. | [ |
Databases used for lead discovery, optimization, and synthesis.
| Database | Specific Information | Ref. |
|---|---|---|
| ADReCS | Database of toxicology information with 137,619 Drug-ADR pairs. | [ |
| ChEMBL | Database of drug-like small molecules with predicated bioactive properties. | [ |
| ChemSpider | Encompassing database of over 64 million chemical structures. | [ |
| DrugCentral | Database containing relevant drug information of activity, chemical identity, mode of action, etc. | [ |
Figure 2Schematic view of drug development using random forest (RF) (a) and support vector machine (SVM) (b). (a) RF reaches the final decision of drugs by combining the results of randomly-created decision trees (three trees are shown for simplicity). There are multiple features that the computational queries look for in both target and drug. When there is a compatibility match, it proceeds to the next step to match additional features. A series of datasets is inputted into the query, and each tree is responsible for computing a prediction. The prediction picked by most trees is used for the next step. The system of using many decision trees is intended to minimize errors mathematically. (b) SVM utilizes similarities between the classes, called support vectors, to distinguish between the classes based on the trained features. It formulates hyperplanes that separate two classes (can be multiclass, if needed). SVM incorporates multiple training sets depending on the classifiers and formulates compounds’ status (active or inactive). During the process, compounds are separated into three sections: Non-selective compounds (active), selective compounds (active), and in the margin are inactive compounds. Although non-selective compounds are active, they are not selective towards the protein of interest. In contrast, selective compounds are active and selective towards the protein of interest.
Figure 3The general scheme of deep neural network (DNN) (a) and recurrent neural network (RNN) (b). (a) DNN consists of an input layer followed by several hidden layers and an output layer. In this case, the input layer utilizes feature vectors generated by a convolutional network. The progression of the NN follows a single path through hidden layer 1 (HL1) to HLn, indicating the feedforward nature of the NN. The generated outputs are often processed using supervised learning techniques for the identification and collection of sensible interactions. (b) RNN begins with a seed, S, which is inputted into the system. Through the use of algorithmic processing, the seed is turned into a reference vector, V1, which is used by the HL to generate a vector output, V2. V2 is subsequently optimized through input training sets and creates the output, O. The generation of these outputs eventually leads to the creation of a gatherable data set. In the meantime, the HLs feed forward to provide information from previous steps. One example is chemical structure generation using SMILE string characters as seeds; hence the desired gathered outputs would be a string of SMILE characters that would be the desired structure. The dataset created in the figure is gathered and analyzed into the resultant molecules.