| Literature DB >> 28158612 |
Feng-Biao Guo1, Chuan Dong1, Hong-Li Hua1, Shuo Liu1, Hao Luo2, Hong-Wan Zhang1, Yan-Ting Jin1, Kai-Yue Zhang1.
Abstract
MOTIVATION: Previously constructed classifiers in predicting eukaryotic essential genes integrated a variety of features including experimental ones. If we can obtain satisfactory prediction using only nucleotide (sequence) information, it would be more promising. Three groups recently identified essential genes in human cancer cell lines using wet experiments and it provided wonderful opportunity to accomplish our idea. Here we improved the Z curve method into the λ-interval form to denote nucleotide composition and association information and used it to construct the SVM classifying model.Entities:
Mesh:
Year: 2017 PMID: 28158612 PMCID: PMC7110051 DOI: 10.1093/bioinformatics/btx055
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Description of the construction of the human essential and non-essential gene datasets
Fig. 2Description of the process of constructing the oligonucleotides using the λ-interval Z curve method. A gene or an ORF has three phases (I, II, III represent first, second and third phase, respectively) denoted with different colors (Color version of this figure is available at Bioinformatics online.)
AUC values at different w, λ and penalty parameters c
| Variables | c | AUC |
|---|---|---|
| 256 | 0.8002 | |
| 64 | 0.8198 | |
| 16 | 0.8256 | |
| 128 | 0.8275 | |
| 64 | 0.8293 | |
| 64 | 0.8365 | |
| 32 | 0.8276 | |
| 1 | 0.8333 | |
| 0.5 | 0.8347 | |
| 0.25 | 0.8356 | |
| 0.5 | 0.8408 | |
| 0.25 | 0.8429 | |
| 1 | 0.8344 | |
| 0.5 | 0.8369 | |
| 0.5 | 0.8386 | |
| 0.25 | 0.8413 | |
| 0.25 | 0.8436 | |
| 0.25 | 0.8449 |
The subscript of the variables correspond to w and λ.
Fig. 3AUC values obtained under different top n features and the contribution of each group. (A) The AUC scores under different top features. Dots with different colors denote different c values. (B) The selective tendentiousness for every variable type. The red bars denote that the selective tendentiousness is statistically significant in hypergeometric distribution test (Color version of this figure is available at Bioinformatics online.)
Feature details for every variable type in the top 800 selective features
| Variables | No. (A) | No. (E) | P(A) | P(E) | P(A)/P(E) - 1 | AUC | C | |
|---|---|---|---|---|---|---|---|---|
| 3 | 9 | 0.0038 | 0.0020 | 0.8938 | 0.200495 | – | – | |
| 7 | 36 | 0.0088 | 0.0079 | 0.1047 | 0.452683 | 0.7220 | 1024 | |
| 3 | 36 | 0.0038 | 0.0079 | −0.5266 | 0.965324 | 0.6216 | 64 | |
| 4 | 36 | 0.0050 | 0.0079 | −0.3688 | 0.900314 | 0.6028 | 1024 | |
| 4 | 36 | 0.0050 | 0.0079 | −0.3688 | 0.900314 | 0.6018 | 128 | |
| 8 | 36 | 0.0100 | 0.0079 | 0.2625 | 0.292801 | 0.6302 | 256 | |
| 20 | 144 | 0.0250 | 0.0317 | −0.2109 | 0.906339 | 0.6841 | 1024 | |
| 25 | 144 | 0.0313 | 0.0317 | −0.0137 | 0.566007 | 0.6633 | 512 | |
| 22 | 144 | 0.0275 | 0.0317 | −0.1320 | 0.802176 | 0.6856 | 1024 | |
| 25 | 144 | 0.0313 | 0.0317 | −0.0137 | 0.566007 | 0.6816 | 1024 | |
| 94 | 576 | 0.1175 | 0.1267 | −0.0729 | 0.821727 | 0.7652 | 8 | |
| 84 | 576 | 0.1050 | 0.1267 | −0.1715 | 0.983389 | 0.7125 | 32 | |
| 107 | 576 | 0.1338 | 0.1267 | 0.0554 | 0.272681 | 0.7168 | 1024 | |
| 86 | 576 | 0.1075 | 0.1267 | −0.1518 | 0.970308 | 0.7027 | 64 | |
| 103 | 576 | 0.1288 | 0.1267 | 0.0159 | 0.44446 | 0.6983 | 1024 | |
| 107 | 576 | 0.1338 | 0.1267 | 0.0554 | 0.272681 | – | – |
Notes: The feature groups statistically significant in hypergeometric distribution test are indicated bold font.
No. (A): feature numbers in the top 800 features; No. (E): feature numbers in the original variable FV.
P(A): actual frequency that variables in the top 800 features, P(A) = No.(A)/800; P(E): expected frequency that variables in FV, P(E) = No.(E)/4545.
P value: hypergeometric distribution test.