| Literature DB >> 22759572 |
Cornelia Caragea1, Adrian Silvescu, Prasenjit Mitra.
Abstract
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.Entities:
Year: 2012 PMID: 22759572 PMCID: PMC3380737 DOI: 10.1186/1477-5956-10-S1-S14
Source DB: PubMed Journal: Proteome Sci ISSN: 1477-5956 Impact factor: 2.480
Figure 1Feature hashing on sparse high-dimensional feature spaces. Feature hashing is performed to reduce very high dimensions to mid-size dimensions, which does not significantly distort the data.
Figure 2The feature hashing representations. The transformation of "bag of k-grams" into the feature hashing representations.
Figure 3The distribution of the variable length . The variable length k-grams in each protein data set: (a) non-plant, (b) plant, and (c) psortNeg, follow a Zipf distribution, i.e., only very few k-grams occur with high frequency, whereas the majority of them occur very rarely.
Comparison of fixed-length with variable-length k-gram representations.
| Bag of fixed or variable length | non-plant | |
|---|---|---|
| Accuracy % | # features | |
| 1-grams | 71.21 | 20 |
| 2-grams | 70.85 | 400 |
| 3-grams | 79.80 | 7999 |
| 4-grams | 79.03 | 146598 |
| (1-2)-grams | 70.56 | 420 |
| (1-3)-grams | 79.69 | 8419 |
| (1-4)-grams | ||
| (1-5)-grams | 80.09 | 950849 |
The performance of SVM classifiers trained using feature hashing on fixed length, 1-, 2-, 3-, 4-gram representations, as well as variable length, (1-2)-, (1-3)-, (1-4)-, (1-5)-grams representations, where the hash size is set to 222, on the non-plant data set.
Figure 4Feature hashing vs. "bag of . Comparison of feature hashing with the "bag of variable length k-grams" approach, referred as baseline on the protein data sets: (a) non-plant, (b) plant, and (c) psortNeg, respectively, using (1-4)-grams representations.
The number of variable-length k-grams and the rate of hash collisions for various hash sizes.
| Value of | non-plant | plant | psortNeg | |||
|---|---|---|---|---|---|---|
| # features | Collisions % | # features | Collisions % | # features | Collisions % | |
| 222 | 155017 | 0 | 111544 | 0 | 124389 | 0 |
| 220 | 153166 | 1.21 | 110236 | 1.18 | 122894 | 1.22 |
| 219 | 147223 | 5.29 | 107299 | 3.95 | 118871 | 4.64 |
| 218 | 132754 | 16.30 | 99913 | 11.43 | 109535 | 13.22 |
| 217 | 99764 | 45.04 | 82141 | 31.38 | 87618 | 35.66 |
| 216 | 59358 | 78.53 | 53616 | 64.29 | 55555 | 68.85 |
| 215 | 32474 | 95.80 | 31788 | 89.56 | 32075 | 92.02 |
| 214 | 16384 | 100 | 16384 | 100 | 16384 | 100 |
The number of unique features (denoted as # features) and the rate of collisions on non-plant, plant, and psortNeg data sets, respectively, for variable length k-gram representations, where k varies from 1 to 4.