| Literature DB >> 25923811 |
Anna L Swan, Dov J Stekel, Charlie Hodgman, David Allaway, Mohammed H Alqahtani, Ali Mobasheri, Jaume Bacardit.
Abstract
BACKGROUND: Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets. RESULTS AND DISCUSSION: Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large 'omics' datasets are increasingly being used in the area of rheumatology.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25923811 PMCID: PMC4315157 DOI: 10.1186/1471-2164-16-S1-S2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Example of a rule set generated by BioHEL. Rule sets are generated by BioHEL to classify samples. The combination of rules in the rules sets are used to assign samples to their respective treatment groups. Each rule contains one or more gene and an expression value which each gene should either be above or below, depending on the rule. At the end of each line is the group to which each rule relates. For example, the 1st rule of the rule set shown classifies all samples as belonging to the OA class if the value of the gene attribute 207211_at is greater than 100.
Descriptions of datasets analysed, including both proteomic and transcriptomic. Those prefixed 'GSE' were from NCBI GEO and those prefixed 'E-GEOD' were from ArrayExpress.
| Dataset | No. of samples | No. of genes | No. of classes | Description |
|---|---|---|---|---|
| 23 | 178 | 4 | Articular cartilage dataset treated with IL-1β to stimulate inflammation. Some samples were also treated with carprofen, a non-steroidal anti-inflammatory drug. Other samples were treated with carprofen only or nothing, as a control. The emPAI dataset includes emPAI label-free quantitation to compare protein quantities across samples, for the proteins with Mascot scores above 30. The ProteinProphet dataset includes a probability for each protein identified in each sample indicating how likely it is to be present in the sample. | |
| 23 | 1322 | 4 | ||
| 25 | 54675 | 5 | Comparison between gene expression in synovial biopsies from patients with OA, RA, Systemic Lupus Erythematosus (SLE), seronegative arthritis (SA), and microcrystalline arthritis (MIC) [ | |
| 48 | 17048 | 3 | Comparison between OA, RA & Pigmented villonodular synovitis (VS), a rare group of lesions with morphological features suggesting an inflammatory as well as a neoplastic nature. All three diseases result in a progressive destruction of affected joints and remain a diagnostic difficulty because of nonspecific symptoms. Tissue samples obtained from knee surgery [ | |
| 31 | 22284 | 3 | Gene expression variances were tested in synovial membrane samples of RA patients, OA patients, and normal controls [ | |
| 31 | 44397 | 3 | Comparison of gene expression between two pathological groups of human synovial fibroblasts (SF) from RA and OA synovial tissues with normal SF from healthy individuals [ | |
| 19 | 54675 | 2 | Gene expression profiling of bone marrow-derived mononuclear cells from patients with RA vs. OA [ | |
Figure 2Workflow of the RGIFE heuristic, where, each iteration, genes are removed and only returned if their removal lowers the classification accuracy. (Leave-one-out cross validation was used to assess the classification abilities of the models built).
Classification accuracies achieved using BioHEL with and without also using the RGIFE heuristic.
| Classifier | BioHEL, no feature reduction | RGIFE+BioHEL | ||
|---|---|---|---|---|
| TPR | TNR | TPR | TNR | |
| ProteinProphet | 0.74 | 0.90 | 0.91 | 0.97 |
| emPAI | 0.57 | 0.81 | 0.96 | 0.99 |
TPR and TNR achieved by BioHEL compared to the other best methods for the five transcriptomics datasets, using leave-one-out cross validation
| NaiveBayes | SVM | IBk | Jrip | J48 | RandomForest | BioHEL | RGIFE+ BioHEL | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.24 | 0.67 | 0.28 | 0.72 | 0.84 | 0.97 | 0.48 | 0.66 | 0.24 | 0.62 | 0.44 | 0.83 | 0.76 | 0.91 | 0.96 | 0.98 | |
| 0.58 | 0.73 | 0.39 | 0.60 | 0.67 | 0.79 | 0.67 | 0.78 | 0.75 | 0.84 | 0.54 | 0.72 | 0.73 | 0.83 | 1.00 | 1.00 | |
| 0.84 | 0.91 | 0.39 | 0.61 | 0.71 | 0.83 | 0.58 | 0.73 | 0.58 | 0.74 | 0.77 | 0.87 | 0.87 | 0.93 | 0.97 | 0.98 | |
| 0.48 | 0.63 | 0.48 | 0.68 | 0.65 | 0.79 | 0.61 | 0.76 | 0.42 | 0.59 | 0.55 | 0.71 | 0.77 | 0.87 | 0.84 | 0.90 | |
| 1.00 | 1.00 | 1.00 | 1.00 | 0.95 | 0.95 | 0.90 | 0.89 | 0.90 | 0.91 | 0.84 | 0.86 | 1.00 | 1.00 | 1.00 | 1.00 | |
The proteins included in the reduced datasets identified by RGIFE for the canine proteomics emPAI and ProteinProphet data.
| Protein ID | Protein name | Identified from emPAI or ProteinProphet | Protein description | Known link to cartilage inflammation or OA |
|---|---|---|---|---|
| MMP-3 | matrix-metalloproteinase 3 | ProteinProphet and emPAI | MMP-3 is a proteolytic enzyme known to degrade components of the ECM, including collagens and cartilage proteoglycans [ | Found to be down-regulated in late OA [ |
| IL-8 | interleukin-8 | ProteinProphet | IL-8 is a chemotactic factor that attracts neutrophils, basophils, and T-cells, but not monocytes. It is also involved in neutrophil activation. It is released from several cell types in response to an inflammatory stimulus [ | IL-8 is the major chemotactic factor released in response to proinflammatory cytokines in synovial tissues from RA and OA affected joints [ |
| TSP1 | thrombospondin-1 | ProteinProphet | Adhesive glycoprotein that mediates cell-to-cell and cell-to-matrix interactions [ | Levels of TSP1 are increased after the onset of OA [ |
| APOE | apolipoprotein E | ProteinProphet | APOE mediates the binding, internalization, and catabolism of lipoprotein particles [ | No known link. |
| HPLN1 | hyaluronan and proteoglycan link protein 1 | ProteinProphet | Stabilizes the aggregates of proteoglycan monomers with hyaluronic acid in the extracellular cartilage matrix [ | HPLN1 has been associated with OA and osteophyte formation [ |
| TPIS | triosephosphate isomerase | ProteinProphet | Catalyses the reaction D-glyceraldehyde 3-phosphate = glycerone phosphate [ | No known link. |
| CLUS | clusterin | emPAI | A glycoprotein that functions as extracellular chaperone that prevents aggregation of non-native proteins, which is involved in many diverse biological functions [ | Higher levels of clusterin have been observed in synovial fluid of advanced primary knee and hip OA patients [ |
| FETUA | alpha-2-HS-glycoprotein/fetuin-A | emPAI | Promotes endocytosis, possesses opsonic properties and influences the mineral phase of bone [ | FETUA levels have been found to decrease as the severity of knee OA increases [ |
| POLG | Genome polyprotein | emPAI | Bacterial protein. | No known link. |
| ATPX | ATP synthase subunit b' | emPAI | Bacterial protein. | No known link. |
The number of genes present in each dataset before and after feature reduction with RGIFE
| Dataset | Whole dataset No. of genes identifiers | Reduced datasets No. of genes identifiers |
|---|---|---|
| GSE36700 | 54675 | 24 |
| GSE3698 | 17048 | 19 |
| E-GEOD-12021 | 22284 | 5 |
| E-GEOD-29746 | 44397 | 669 |
| E-GEOD-27390 | 54675 | 14 |
Comparison of FS methods applied to both transcriptomics and proteomics, for each combination of classifier and dataset. The number of times each FS method resulted in the highest TPR and the lowest TPR are shown.
| Method | No. of times the method results in the highest TPR | No. of times the method results in the lowest TPR |
|---|---|---|
| CFS | 3 | 15 |
| Chisquared | 2 | 4 |
| NaiveBayes - FS | 6 | 6 |
| Random Forest - FS | 10 | 6 |
| RGIFE | 13 | 5 |
| SVM RFE | 16 | 13 |
Comparison of FS methods applied to both transcriptomics and proteomics, for each combination of classifier and dataset. The number of times each FS method resulted in the highest TNR and the lowest TNR are shown
| Method | No. of times the method results in the highest TNR | No. of times the method results in the lowest TNR |
|---|---|---|
| CFS | 3 | 14 |
| Chisquared | 3 | 4 |
| NaiveBayes - FS | 6 | 7 |
| Random Forest - FS | 10 | 7 |
| RGIFE | 12 | 5 |
| SVM RFE | 15 | 12 |