| Literature DB >> 22102890 |
Hao Xiong1, Daniel Capurso, Saunak Sen, Mark R Segal.
Abstract
Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22102890 PMCID: PMC3213122 DOI: 10.1371/journal.pone.0027382
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Illustrative diagram of data flow through the pipeline.
Data is initially partitioned into discovery and classification sets. The classification set is further partitioned into training and validation sets. After WordSpy elicits motifs using the discovery set, fuzznuc or fuzzpro counts corresponding motif occurrences in the remaining data. The training data counts are used to train a classifier, while the validation data counts are used to determine performance (e.g. AUC) of the learned classifier.
Nucleosome occupancy data.
| Dataset | DMFS Default | DMFS Tuned | Reported | Enumerative | |||
| SVM | RF | SVM | RF | SVM | SVM | RF | |
| Dennis | 0.908 | 0.902 | 0.91 | 0.905 | 0.908 | 0.92 | 0.918 |
| Ozsolak | 0.766 | 0.764 | 0.78 | 0.768 | 0.737 | 0.8 | 0.79 |
Mean AUCs for the nucleosome occupancy datasets and approaches as described in the text. Reported values are from Gupta et al. [4]. The DMFS pipeline results are stable with small standard deviations as determined by 40 runs with random data partitioning: Dennis data with (a) default parameter settings: 0.0055 (SVM) and 0.0036 (RF), and (b) tuned parameter settings: 0.0048 (SVM) and 0.0041 (RF); Ozsolak data with (a) default parameter settings: 0.0084 (SVM) and 0.0078 (RF), and (b) tuned parameter settings: 0.011 (SVM) and 0.0086 (RF).
Figure 2ROC curves from DMFS and enumerative methods for the nucleosome occupancy datasets.
The red and green curves are from Gupta et al. [4] for the Dennis and Ozsolak data respectively. The black and blue curves are from the DMFS method for the Dennis and Ozsolak data respectively. For both datasets, the DMFS ROC curve is approximately equal to the ROC curve using enumerative feature generation. This figure was created by manipulating Figure 1 of Gupta et al. [4] in GIMP. The DMFS ROC curves are relative stable. As the false positive rate ranges from 10% to 90% the true positive rate standard deviations have range to for the Dennis data and to for the Ozsolak data.
DMFS pipeline recovery of previously identified motifs.
| Reported motif | Dennis | Ozsolak |
|
| 37 | 36 |
|
| 26 | 28 |
|
| 7 | 6 |
|
| 5 | 4 |
|
| 37 | 38 |
|
| 2 | 8 |
|
| 1 | 1 |
|
| 6 | 19 |
|
| 2 | 6 |
|
| 37 | 34 |
|
| 28 | 22 |
|
| 4 | 0 |
|
| 32 | 39 |
|
| 9 | 1 |
|
| 13 | 4 |
|
| 4 | 8 |
|
| 40 | 40 |
|
| 10 | 4 |
Here we list motifs identified by Tillo and Hughes [48] and Lee et al. [49] and the number of times these motifs were identified by the DMFS pipeline. Structure related features are omitted, as are transcription binding start sites and features with zero weights. We ran the DMFS pipeline 40 times, with random data partitioning, and counted the number of times each previously identified motif occurred. According to Tillo and Hughes the most discriminative motif is the 4-mer AAAA/TTTT, which emerged in almost every run.
Protein solubility data.
| DMFS Default | DMFS Tuned | Enumerative | Reported | ||||
| SVM | 0.62 | SVM | 0.63 | SVM ( | 0.63 | 1-mer (SVM) | 0.644 |
| RF | 0.61 | RF | 0.645 | RF ( | 0.64 | 2-mer (SVM) | 0.597 |
| 3-mer (SVM) | 0.548 | ||||||
Protein solubility data accuracies for default and tuned parameters settings, as well as for reported and eumerative methods. Reported values are from Magnan et al. [5]. The DMFS pipeline results are stable with small standard deviations as determined by 20 runs with random data partitioning: (a) default parameter settings: 0.006 (SVM) and 0.0048 (RF), and (b) tuned parameter settings: 0.0052 (SVM) and 0.0049 (RF).