| Literature DB >> 16689686 |
Zhi Hua Liu1, Dian Jiao, Xiao Sun.
Abstract
Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16689686 PMCID: PMC5172532 DOI: 10.1016/s1672-0229(05)03027-5
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
The Result of Principal Component Analysis
| Component | Initial eigenvalue | Extraction sum of squared loadings | ||||
|---|---|---|---|---|---|---|
| Total | Variance (%) | Cumulation (%) | Total | Variance (%) | Cumulation (%) | |
| 1 | 31.128 | 27.793 | 27.793 | 31.128 | 27.793 | 27.793 |
| 2 | 12.589 | 11.240 | 39.033 | 12.589 | 11.240 | 39.033 |
| 3 | 8.365 | 7.469 | 46.503 | 8.365 | 7.469 | 46.503 |
| 4 | 8.075 | 7.210 | 53.713 | 8.075 | 7.210 | 53.713 |
| 5 | 4.726 | 4.220 | 57.933 | 4.726 | 4.220 | 57.933 |
| 6 | 4.192 | 3.743 | 61.675 | 4.192 | 3.743 | 61.675 |
| 7 | 3.836 | 3.425 | 65.100 | 3.836 | 3.425 | 65.100 |
| 8 | 3.425 | 3.058 | 68.158 | 3.425 | 3.058 | 68.158 |
| 9 | 2.938 | 2.624 | 70.782 | 2.938 | 2.624 | 70.782 |
| 10 | 2.775 | 2.478 | 73.259 | 2.775 | 2.478 | 73.259 |
| 11 | 2.606 | 2.327 | 75.586 | 2.606 | 2.327 | 75.586 |
| 12 | 1.928 | 1.721 | 77.308 | 1.928 | 1.721 | 77.308 |
| 13 | 1.880 | 1.678 | 78.986 | 1.880 | 1.678 | 78.986 |
| 14 | 1.663 | 1.485 | 80.471 | 1.663 | 1.485 | 80.471 |
| 15 | 1.565 | 1.397 | 81.868 | 1.565 | 1.397 | 81.868 |
| 16 | 1.515 | 1.353 | 83.221 | 1.515 | 1.353 | 83.221 |
| 17 | 1.293 | 1.154 | 84.375 | 1.293 | 1.154 | 84.375 |
| 18 | 1.276 | 1.139 | 85.515 | 1.276 | 1.139 | 85.515 |
| 19 | 1.170 | 1.045 | 86.559 | 1.170 | 1.045 | 86.559 |
| 20 | 1.067 | 0.953 | 87.512 | 1.067 | 0.953 | 87.512 |
| 21 | 1.052 | 0.939 | 88.451 | 1.052 | 0.939 | 88.451 |
| 22 | 0.925 | 0.826 | 89.277 | |||
| 23 | 0.831 | 0.742 | 90.019 | |||
| 24 | 0.786 | 0.702 | 90.721 | |||
| 25 | 0.677 | 0.605 | 91.326 | |||
Fig. 1Classification of the upstream (red), coding (green), and downstream (blue) regions. The horizontal axis represents the function value of the first linear distinction, and the vertical axis represents the function value of the second linear distinction, which is based on calculations from the variable value.
The Statistical Result of Discriminant Analysis*
| Result | Predicted group membership | Total | |||||
|---|---|---|---|---|---|---|---|
| Group | 1 | 2 | 3 | 4 | 5 | ||
| Original | 1 | 71 | 0 | 7 | 8 | 14 | 100 |
| 2 | 1 | 94 | 0 | 2 | 3 | 100 | |
| 3 | 7 | 0 | 86 | 5 | 2 | 100 | |
| 4 | 4 | 1 | 13 | 69 | 13 | 100 | |
| 5 | 5 | 2 | 12 | 12 | 69 | 100 | |
| Cross-validated | 1 | 68 | 4 | 8 | 7 | 13 | 100 |
| 2 | 1 | 94 | 0 | 2 | 3 | 100 | |
| 3 | 7 | 0 | 86 | 5 | 2 | 100 | |
| 4 | 6 | 2 | 16 | 57 | 19 | 100 | |
| 5 | 9 | 4 | 18 | 13 | 56 | 100 | |
“Original” is the classification result of each observated sample, and “Cross-validated” is the alternately confirmed result. Groups 1 to 5 represent the upstream, exon, intron, downstream, and intergenic regions, respectively. In “Predicted group membership”, the established discriminative function reclassifies the source data and is compared with the primary variable value to compute the probability of mistaken discriminant. For example, for the 1st group of samples with the total number of 100, the constructed discriminative function based on the original data predicts that the number belongs to the 1st, 2nd, 3rd, 4th, and 5th group is 71, 0, 7, 8, and 14, respectively.