| Literature DB >> 29161975 |
Lin Sun1,2,3, Xiaoyu Zhang1, Jiucheng Xu1, Wei Wang1,3, Ruonan Liu1.
Abstract
In recent years, tumor classification based on gene expression profiles has drawn great attention, and related research results have been widely applied to the clinical diagnosis of major gene diseases. These studies are of tremendous importance for accurate cancer diagnosis and subtype recognition. However, the microarray data of gene expression profiles have small samples, high dimensionality, large noise and data redundancy. To further improve the classification performance of microarray data, a gene selection approach based on the Fisher linear discriminant (FLD) and the neighborhood rough set (NRS) is proposed. First, the FLD method is employed to reduce the preliminarily genetic data to obtain features with a strong classification ability, which can form a candidate gene subset. Then, neighborhood precision and neighborhood roughness are defined in a neighborhood decision system, and the calculation approaches for neighborhood dependency and the significance of an attribute are given. A reduction model of neighborhood decision systems is presented. Thus, a gene selection algorithm based on FLD and NRS is proposed. Finally, four public gene datasets are used in the simulation experiments. Experimental results under the SVM classifier demonstrate that the proposed algorithm is effective, and it can select a smaller and more well-classified gene subset, as well as obtain better classification performance.Entities:
Keywords: Fisher linear discriminant; Gene selection; neighborhood rough set; reduction
Mesh:
Year: 2017 PMID: 29161975 PMCID: PMC5972918 DOI: 10.1080/21655979.2017.1403678
Source DB: PubMed Journal: Bioengineered ISSN: 2165-5979 Impact factor: 3.269
Description of the four experimental data sets.
| Data set | Feature size | Sample size (normal/tumor) | Class size |
|---|---|---|---|
| Colon | 2000 | 62 (40/20) | 2 |
| Leukemia | 7129 | 72 (25/47) | 2 |
| Lung | 12533 | 181 (31/150) | 2 |
| Prostate | 12600 | 136 (77/59) | 2 |
Selected gene subsets of four data sets using FLD-NRS.
| Data set | Gene subset after reduction |
|---|---|
| Colon | {1423, 765, 822, 66, 1870, 590} |
| Leukemia | {1834, 2354, 2642, 1685, 758} |
| Lung | {2549, 7200, 6139} |
| Prostate | {8986, 11052, 6392, 4050} |
Figure 1.Classification accuracy of the four data sets under different classifiers.
Selected gene number and classification accuracy of the four algorithms on different data sets.
| ODP | Lasso | NRS | FLD-NRS | |||||
|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||
| Colon | 2000 | 0.811 | 5 | 0.887 | 4 | 0.611 | 6 | 0.880 |
| Leukemia | 7129 | 0.944 | 23 | 0.986 | 5 | 0.645 | 5 | 0.828 |
| Lung | 12412 | 0.903 | 8 | 0.995 | 3 | 0.641 | 3 | 0.889 |
| Prostate | 12206 | 0.619 | 63 | 0.961 | 4 | 0.647 | 4 | 0.800 |
| Time complexity | — | |||||||
Selected gene number and classification accuracy of three algorithms on different data sets.
| RF | SNRRF | FLD-NRS | ||||
|---|---|---|---|---|---|---|
| Data set | ||||||
| Colon | 2000 | 0.848 | 72 | 0.875 | 6 | 0.880 |
| Leukemia | 7129 | 0.902 | 26 | 0.948 | 5 | 0.828 |
| Lung | 2880 | 0.864 | 10 | 0.899 | 3 | 0.889 |
| Prostate | 12600 | 0.925 | 49 | 0.931 | 4 | 0.800 |
| Time complexity | ||||||