| Literature DB >> 34243809 |
Sebastian Racedo1, Ivan Portnoy2,3, Jorge I Vélez1, Homero San-Juan-Vergara1, Marco Sanjuan1, Eduardo Zurek1.
Abstract
BACKGROUND: High-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs.Entities:
Keywords: 16 rRNA sequencing; Classification method; Compositional nature; Microbial communities
Year: 2021 PMID: 34243809 PMCID: PMC8268467 DOI: 10.1186/s13040-021-00266-7
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Evolution of the number of publications per year from 2009 to 2019
Fig. 2Scheme of gene analysis used for sample classification
Summary of feature selection approaches in gene expression analysis
| Category | Description | Weaknesses | References |
|---|---|---|---|
| - Extract features from the data without any type of learning involved. | - Ignore interaction with the classifier. | [ | |
| - Use learning approaches to evaluate which features are useful. | - Risk of overfitting. - Classifier dependent selection. | [ | |
| - Combine the traditional feature selection step with the classifier construction. | - Classifier dependent selection. | [ |
Summary of classifiers used in gene expression analysis
| Category | Classifier | References |
|---|---|---|
- Probabilistic: Bayesian classifier, probabilistic linear discriminant analysis. - Non probabilistic: Support Vector Machine (SVM), SVM-RFE, Nearest-neighbor (NN), linear discriminant analysis. | [ | |
| - Fuzzy Logic, Genetic Algorithms, Classification and Regression trees. | [ | |
| - LogitBoost, AdaBoost.M1, GradientBoosting (GrBoost) | [ |
Fig. 3Bidimensional representation of datasets and X a without pretreatment, and b after the pretreatment along with the eigenvectors scaled by the corresponding eigenvalues
Fig. 4Illustration of new samples and the line that separates both groups with the proposed method. Samples lying in the upper semi-plane will be classified in the case (v) group and in the control (c) group otherwise
Performance of the proposed method for synthetic datasets. Configurations (n, m) not reported showed 100% Classification Accuracy
| Sample size ( | Number of features ( | Classification Accuracy (%) |
|---|---|---|
| 80 | 40 | 99.8 |
| 100 | 20 | 98.1 |
| 120 | 20 | 99.7 |
| 160 | 20 | 98.0 |
Classification accuracy for each method for the AGP and GG data sets
| Dataset | SVM | SVM-RFE | Proposed Method |
|---|---|---|---|
| 92.03% | 96.33% | 95.06% | |
| 89.34% | 92% | 94% |
Fig. 5Illustration of new samples and the line that separates both groups with the proposed method for the AGP (left) and GG (right) data sets