| Literature DB >> 30534555 |
Zichuan Fan1, Fanchen Kong1, Yang Zhou1, Yiqing Chen1, Yalan Dai1.
Abstract
Mass spectrometry (MS) is an important technique in protein research. Effective classification methods by MS data could contribute to early and less-invasive diagnosis and also facilitate developments in the bioinformatics field. As MS data is featured by high dimension, appropriate methods which can effectively deal with the large amount of MS data have been widely studied. In this paper, the applications of methods based on intelligence algorithms have been investigated. Firstly, classification and biomarker analysis methods using typical machine learning approaches have been discussed. Then those are followed by the Ensemble strategy algorithms. Clearly, simple and basic machine learning algorithms hardly addressed the various needs of protein MS classification. Preprocessing algorithms have been also studied, as these methods are useful for feature selection or feature extraction to improve classification performance. Protein MS data growing with data volume becomes complicated and large; improvements in classification methods in terms of classifier selection and combinations of different algorithms and preprocessing algorithms are more emphasized in further work.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30534555 PMCID: PMC6252195 DOI: 10.1155/2018/2862458
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Typical classification algorithms and their characters and samples.
| Method | Feature | Samples |
|---|---|---|
| Logistic Regression | can predicate the peak intensity patterns exactly and simplify a SVD decomposition [ | Tandem mass spectrometry |
|
| ||
| KNN algorithm | by Euclidean distance or by one minus correlation. [ | ovarian cancer MALDI-MS database |
| a modification of Euclidean distance formula [ | patients with mild cognitive impairment and patients with clinical symptoms of Alzheimer's disease [ | |
|
| ||
| Support vector machines | using 4 genes | colon cancer database |
| suitable for noisy high-throughput proteomics and microarray data and outperforming in the robustness to noise | SELDI-TOF-MS | |
| an unsupervised feature selection phase, restriction of the coefficient of variation and wavelet analysis for classification [ | ovarian cancer database [ | |
|
| ||
| Decision tree algorithm | a new high-throughput proteomic classification system, and developed by a nine-protein mass pattern [ | blood samples from prostate cancers and healthy man cohort [ |
|
| ||
| Classification tree | partitioning the learning sample into smaller and smaller subsamples to ensure the disease status within each subsample is relatively homogeneous [ | clinical specimens [ |
| combining MALDI-TOF MS with WCX magnetic beads, and with high sensitivity (98.3%) and high specificity (84.4%) [ | patients with pulmonary tuberculosis [ | |
| boosted feature extraction coupled with the nearest centroid classifier with high accuracy [ | OCWCX2a [ | |
|
| ||
| Random Forest | used as both feature extractors and classifier and suit for the small sample [ | serum samples from patients with ovarian cancer [ |
| a complex proteome with a wide range of protein concentrations [ | signature peptides [ | |
| nonlinear random and combined with a discrete mapping approach [ | phosphorylation data set [ | |
|
| ||
| Neural Networks algorithm | a multilayer perceptron ANN with a backpropagation algorithm [ | SELDI-MS data [ |
| using Naive Bayes with a multilayer perceptron [ | mass data set with InfoGain and Relief-F [ | |
| basing on SRNG and FLSOM [ | breast cancer listeria and tissue data set [ | |
| convolutional neural networks [ | Q-TOF and IT [ | |
Biomarker analysis algorithms and their advantages as well as disadvantages.
| Method | Advantages | Disadvantages | Samples |
|---|---|---|---|
| Support Vector Machine | High robustness to noise and good ability to recover informative features, could work well on nonlinear problems. | Inferior in terms of the number of recovered informative genes, must according to the collaborative information of multiple genes, hard to train and hard to find kernel function | Noisy high-throughput proteomics and microarray data set |
|
| |||
| Decision Tree | easy to interpret, nonparametric method | May be stuck in local minima, overfitting data, could not be learned online | Volatile oils and S. mutans |
|
| |||
| Neural Networks algorithm | Identify masses that accurately predict tumour grade, high cross-validation on test data sensitivity rate and specificity rate | Need huge volume of samples, computational expansive to train, black box model, overfitting, hard to select meta-parameter | Astrocytoma |
Traditional preprocessing algorithms for data mining classification.
| Method | Feature | Samples |
|---|---|---|
| Wavelet algorithm | could capture localized features and keep the time property [ | Ovarian cancer identification [ |
|
| ||
| Genetic algorithm | discriminating VISA and VSSA straining by using the peaks that met the selection criteria described [ | MALDI-TOF MS [ |
| combining genetic algorithm and cluster analysis methods [ | NOCEDP [ | |
| five peptides/proteins from the training group to classify, used to select each peptide peak, and using software to determine the optimal separation planes [ | serum samples, from SCLC patients [ | |