| Literature DB >> 30717669 |
Deepak R Bharti1, Anmol J Hemrom1, Andrew M Lynn2.
Abstract
BACKGROUND: Traditional drug discovery approaches are time-consuming, tedious and expensive. Identifying a potential drug-like molecule using high throughput screening (HTS) with high confidence is always a challenging task in drug discovery and cheminformatics. A small percentage of molecules that pass the clinical trial phases receives FDA approval. This whole process takes 10-12 years and millions of dollar of investment. The inconsistency in HTS is also a challenge for reproducible results. Reproducible research in computational research is highly desirable as a measure to evaluate scientific claims and published findings. This paper describes the development and availability of a knowledge based predictive model building system using the R Statistical Computing Environment and its ensured reproducibility using Galaxy workflow system.Entities:
Keywords: Cheminformatics; Drug discovery; Galaxy workflow system; High throughput screening; Predictive model building; R statistical package; Reproducible results
Mesh:
Year: 2019 PMID: 30717669 PMCID: PMC7394323 DOI: 10.1186/s12859-018-2492-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1QSAR based predictive model building - a typical protocol used in GCAC: The initial data is molecular structural information file (SDF/MOL/SMILE) which can be used to generate molecular descriptors. Once descriptors are generated, data cleaning is performed which ensures the removal of data redundancy. Preprocessing is performed in two steps - first, the removal of missing values and near zero variance features as they are not useful for model building. The input data is split into training and test datasets. The training data set is used for model building and test data is used for model evaluation. In a second step of preprocessing, further treatment is applied as per selected method for model building. It includes removal of zero variance features, highly correlated values, centering and scaling of data in the training data. In the model building step, learning and hyper-parameter optimization is facilitated using resampling, internal cross-validation and performance evaluation over the set of parameters chosen. The most accurate model is selected and evaluated on test data set. The selected model is saved and utilized further when a new set of compounds need to be predicted from a set of compounds of unknown activity
Fig. 2GCAC example workflow. The figure is a screenshot of the galaxy workflow canvas showing the arrangements of individual elements which can be used to create a workflow for model building and prediction of active molecules. Each element is described in more detail in Table 1 of this manuscript
List of Galaxy Tools developed as part of GCAC: The GCAC suite comprises mainly four major tasks. Each task contains one repository and at least one tool associated with it. The GCAC tools are available in galaxy main toolshed (https://toolshed.g2.bx.psu.edu/repository?repository_id=351af44ceb587e54)
| Major Tasks | Toolshed Repositories | Tool Name | Description |
|---|---|---|---|
| Descriptor Calculation | padel_descriptor_calculation | PaDELDescriptor | calculates descriptors for active and inactive datasets. |
| activity_files_merge | Merge Activity Files | assigns response values and merges positive and negative datasets. | |
| redundant_entries_remove | Remove Redundancy | removes redundant entries of molecules. | |
| Feature Selection | feature_selection | Feature Selector | selects best features subset |
| Model Building and Prediction | csv_to_rdata | Prepare input file | converts csv_files to RData format |
| rcaret_classification_model | R-Caret Classification Model-Builder | builds classification model using ‘caret’ R package | |
| activity_predict | Predict Activity | predicts activity of molecules using their descriptor file (prediction set) and supplied model. | |
| Candidate Compound Extraction | candidate_compound_select | Candidate Compound Selector | selects compound name or ids of interesting molecules based on certain cutoff range. |
| compound_id_extract | CompoundID Extractor | extracts compound IDs to be used in downstream compound extraction from sdf files | |
| mayatools_extract | ExtractFromSDFiles | provides sdf file of extracted compounds from the prediction set |
Fig. 3Performance Metrics for Fontaine Dataset. The performance metrics for the ‘fontaine dataset’ a standard dataset described in more detail in the Data section of this manuscript, used to validate the protocol with some example models. RF - Random Forest, bagFDA - bagging with Flexible Discriminant Analysis, KNN - K- Nearest Neighbours, SVMRadial - Support Vector Machine using a Radial kernel function, NaiveBayes - Naive Bayes, GLM - Generalised Linear Model