| Literature DB >> 29416792 |
Akram Mohammed1, Greyson Biegert1, Jiri Adamec1, Tomáš Helikar1.
Abstract
Accurate identification of cancer biomarkers and classification of cancer type and subtype from High Throughput Sequencing (HTS) data is a challenging problem because it requires manual processing of raw HTS data from various sequencing platforms, quality control, and normalization, which are both tedious and time-consuming. Machine learning techniques for cancer class prediction and biomarker discovery can hasten cancer detection and significantly improve prognosis. To date, great research efforts have been taken for cancer biomarker identification and cancer class prediction. However, currently available tools and pipelines lack flexibility in data preprocessing, running multiple feature selection methods and learning algorithms, therefore, developing a freely available and easy-to-use program is strongly demanded by researchers. Here, we propose CancerDiscover, an integrative open-source software pipeline that allows users to automatically and efficiently process large high-throughput raw datasets, normalize, and selects best performing features from multiple feature selection algorithms. Additionally, the integrative pipeline lets users apply different feature thresholds to identify cancer biomarkers and build various training models to distinguish different types and subtypes of cancer. The open-source software is available at https://github.com/HelikarLab/CancerDiscover and is free for use under the GPL3 license.Entities:
Keywords: cancer biomarker; cancer classification; gene expression; machine learning; open-source
Year: 2017 PMID: 29416792 PMCID: PMC5788660 DOI: 10.18632/oncotarget.23511
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Figure 1Schematic representation of the CancerDiscover pipeline
First, raw data are normalized, background correction is performed, and the output is partitioned into training and testing sets. The test set is held in reserve for model testing while the training set undergoes a feature selection method. Feature selection provides a list of ranked attributes that are subsequently used to rebuild the training and testing sets. The training dataset is subsequently used to build machine learning models. Finally, the testing data set is used for model testing.
Figure 2Model accuracies for the classification of tumor vs. normal and adenocarcinoma vs. squamous cell carcinoma: RF represents Random Forest classifier and SVM indicates Support Vector Machine classifier
(A) Training accuracy for Tumor vs. Normal model, (B) Training accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model, (C) Testing accuracy for Tumor vs. Normal model, (D) Testing accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model.
Accuracies of random forest models using top 3 features
| Training model | Precision | Recall | F1-Score |
|---|---|---|---|
| 98.3 | 98.3 | 98.3 | |
| 97.9 | 98.9 | 98.4 |
Benchmarking results
| Samples | Feature selection methods | Models generated | Normalization (Elapsed Time) | Feature selection (Elapsed Time) | Model train & test (Elapsed Time) | Total |
|---|---|---|---|---|---|---|
| 500 | 20 | 665 | 2:05:32 | 21:45:59 | 8:05:32 | 31:57:03 |
| 200 | 20 | 650 | 0:52:31 | 14:16:55 | 4:49:33 | 19:58:59 |
| 100 | 20 | 665 | 0:26:56 | 13:31:22 | 3:12:30 | 17:00:48 |
| 50 | 20 | 665 | 0:16:48 | 12:06:42 | 2:58:56 | 15:12:26 |
| 10 | 19 | 585 | 0:07:03 | 10:05:17 | 2:14:05 | 12:26:25 |
All the datasets contain 54,675 features, and 2 CPUs were used for the analysis.
Elapsed time refers to the amount of real-time spent processing that function.
Comparisons of machine learning classification components
| Tool Components | CancerDiscover | GenePattern | Chipster | Aliferis |
|---|---|---|---|---|
| Normalization | ✓ | - | ✓ | - |
| Background correction | ✓ | - | - | - |
| Partitioning | ✓ | - | - | - |
| Feature selection | ✓ | - | - | ✓ |
| Modeling | ✓ | ✓ | ✓ | ✓ |
This table highlights the capabilities of tools for performing different functions necessary to generate ML models.