| Literature DB >> 29086046 |
Jie Dong1, Zhi-Jiang Yao1,2, Min-Feng Zhu1,2, Ning-Ning Wang1, Ben Lu2, Alex F Chen1,2, Ai-Ping Lu3, Hongyu Miao4, Wen-Bin Zeng1, Dong-Sheng Cao5,6.
Abstract
BACKGROUND: In recent years, predictive models based on machine learning techniques have proven to be feasible and effective in drug discovery. However, to develop such a model, researchers usually have to combine multiple tools and undergo several different steps (e.g., RDKit or ChemoPy package for molecular descriptor calculation, ChemAxon Standardizer for structure preprocessing, scikit-learn package for model building, and ggplot2 package for statistical analysis and visualization, etc.). In addition, it may require strong programming skills to accomplish these jobs, which poses severe challenges for users without advanced training in computer programming. Therefore, an online pipelining platform that integrates a number of selected tools is a valuable and efficient solution that can meet the needs of related researchers.Entities:
Keywords: Cheminformatics; Machine learning; Molecular descriptors; Online modeling; QSAR/SAR
Year: 2017 PMID: 29086046 PMCID: PMC5418185 DOI: 10.1186/s13321-017-0215-1
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The pipelining of ChemSAR. It contains six main modules: (1) User space, (2) Structure preprocessing, (3) Data preprocessing, (4) Modelling process, (5) Model interpretation (6) Tools. Each of them not only could be utilized as one part of the whole pipelining but also could be used as an independent tool
Fig. 2The process of calculating molecular fingerprints—an example to explain the development methods of ChemSAR
The valid file formats and requirements for each module
| Module name | Input | Output | Description |
|---|---|---|---|
| Feature calculation |
|
| The standard version of molfile in SDF must be V2000; 2D or 3D information are both valid; the first row of |
| Model selection |
| Data table in the page | The first column of X_train file can be molecular identifiers like molecular names or IDs; the first row of X_train file must be descriptor names; the first column of y_train file must be the same with X_train file; the second column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1) |
| Model building |
| Data table in the page; | The same with “Model selection” |
| Prediction |
| Data table in the page; | The requirements of X_test file are the same with X_train file |
| Validation of molecules |
|
| The standard version of molfile in SDF must be V2000 |
| Standardization of molecules |
|
| The same with “Validation of molecules” |
| Custom preprocessing |
|
| The same with “Validation of molecules” |
| Imputation of missing values |
|
| The first row of input file must be header like descriptor names; each column including the first one must be feature values like descriptor values |
| Removing low variance features |
|
| The same with “Imputation of missing values” |
| Removing high correlation features |
|
| The same with “Imputation of missing values” |
| Univariate feature selection |
|
| The first row of input file must be header like descriptor names; each column from the first one to the penultimate one must be feature values like descriptor values; The last column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1) |
| Tree-based feature selection |
| Data table in the page; | The same with “Univariate feature selection” |
| RFE feature selection |
| Data table in the page; | The same with “Univariate feature selection” |
| Statistical analysis |
| Data table in the page; | The four columns of input file must be in order: molecular identifier, predict label, predict probability, experimental value; the label name can be defined by users |
| Random training set split |
|
| The same with “Model selection” |
| Diverse training set split |
|
| The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid |
| Feature importance |
|
| The same with “Univariate feature selection” |
The list of molecular descriptors computed by ChemSAR
| Feature group | Features | Number of descriptors |
|---|---|---|
| Constitution | Molecular constitutional descriptors | 30 |
| Topology | Topological descriptors | 35 |
| Connectivity | Molecular connectivity indices | 44 |
| E-state | E-state descriptors | 245 |
| Kappa | Kappa shape descriptors | 7 |
| Basak | Basak descriptors | 21 |
| Burden | Burden descriptors | 64 |
| Autocorrelation | Moreau-Broto autocorrelation | 32 |
| Moran autocorrelation | 32 | |
| Geary autocorrelation | 32 | |
| Charge | Charge descriptors | 25 |
| Property | Molecular property | 6 |
| MOE-type | MOE-type descriptors | 60 |
| CATS | CATS descriptors | 150 |
| Fingerprints | Topological-Torsion fingerprints | 1024 |
| MACCS keys | 167 | |
| FP4 fingerprints | 307 | |
| FP2 fingerprints | 1024 | |
| FP3 fingerprints | 210 | |
| E-state fingerprints | 79 | |
| Daylight-type fingerprints | 1024 | |
| ECFP2 fingerprints | 1024 | |
| ECFP4 fingerprints | 1024 | |
| ECFP6 fingerprints | 1024 |
The supported algorithms and related parameters
| Algorithms | Parameters | Recommended parameters |
|---|---|---|
| RandomForest |
|
|
| SVM |
|
|
| Naïve Bayes |
|
|
| K Neighbors |
|
|
| DecisionTree |
| Automatic decision; |
Fig. 3The workflow of building Caco-2 Cell permeability models
The current tools that can be used for SAR modelling
| Tool name | Type | Structure preprocessing | Data preprocessing | Molecular representation | Feature selection | Model selection | Algorithm type | Fees/register | Coupling |
|---|---|---|---|---|---|---|---|---|---|
| ChemSAR | Online | ✓ | ✓ | ✓ | ✓ | ✓ | I | Free | Low |
| OCHEM | Database | ✓ | ✓ | ✓ | ✓ | I, II | Restricted | High | |
| Chembench | Online | ✓ | ✓ | ✓ | ✓ | ✓ | I, II | Need register | High |
| Vcclab | Online | ✓ | ✓ | I | Free | Low | |||
| OpenTox | Online | For toxicity prediction | Free | High | |||||
| QSAR4U | Online | ✓ | Built-in models | Free | High | ||||
| camb | R package | ✓ | ✓ | ✓ | ✓ | ✓ | I, II | Free | Low |
| AZOrange | Software | ✓ | ✓ | ✓ | Disabled for invalid dependences | Free | Low | ||
| RRegrs | R package | ✓ | ✓ | II | Free | Low | |||
| eTOXlab | Software | ✓ | ✓ | ✓ | ✓ | Models for production environments | Free | High | |
| QSARINS | Software | ✓ | ✓ | II | Restricted | High | |||
| OECD QSAR Toolbox | Software | ✓ | Models for data gap filling | Restricted | High | ||||
| MOLGEN QSPR | Software | ✓ | II | Free | High | ||||
| BuildQSAR | Software | ✓ | II | Free | High | ||||
| McQSAR | Software | ✓ | II | Free | High | ||||
| StarDrop | Software | ✓ | ✓ | I, II | Commercial | – | |||
| MOE | Software | ✓ | ✓ | I, II | Commercial | Low | |||
| DS | Software | ✓ | ✓ | I, II | Commercial | Low |
The I and II represent the classification algorithms and regression algorithms; The “restricted” means that some modules of the tool are limited to the public or need the permits of the developers; The “low” coupling means that the main modules of the tool can be called in the modelling pipelining and can also be used as an independent tool, while the “high” coupling means that they must work together to build a model