| Literature DB >> 22966958 |
Andrew T Magis1, Nathan D Price.
Abstract
BACKGROUND: Relative expression algorithms such as the top-scoring pair (TSP) and the top-scoring triplet (TST) have several strengths that distinguish them from other classification methods, including resistance to overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring 'N' (TSN) algorithm is a generalized form of other relative expression algorithms which uses generic permutations and a dynamic classifier size to control both the permutation and combination space available for classification.Entities:
Mesh:
Year: 2012 PMID: 22966958 PMCID: PMC3663421 DOI: 10.1186/1471-2105-13-227
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The Lehmer code. A complete translation from permutation to decimal, by way of the factoradic, for a permutation of size 4. Each permutation is mapped to a single unique decimal representation. Two additional translations from permutation to factoradic are shown in Additional file 1: Figure S2.
Figure 2Inversions. (Top) There are four inversions required to translate the sorted list [1 2 3 4] into the permutation [3 4 1 2]. The sum of the digits of the factoradic give the number of inversions required to translate one permutation into another. (Bottom) The grey squares indicate the set of permutations that have a single inversion distance from the original (black) permutations.
Figure 3GPU CPU running times. Running times for N = 2, N = 3, and N = 4 over a range of input feature sizes. Each point is the mean of three independent runs of the software. The CPU running time for N = 2 over 20,000 features is similar to the running times for N = 3 over 1000 features and N = 4 over 200 features. The CPU version of TSN was run on a single core of a 2.4 GHz Intel Core 2 processor. The GPU version of TSN was run on an NVIDIA Tesla C2050. The speedup due to the GPU improves as the value of N gets higher: for N = 2, the speedup is 2.3X, for N = 3 the speedup is 2.8X, and for N = 4 the speedup is 4.4X. Running times reflect a single iteration of the algorithm and do not include multiple iterations such as cross validation. Note that running times are also a function of the number of samples in the dataset; there were 70 samples in this dataset.
Figure 4Results of TSN classification on cancer datasets. Results of 100 rounds of 5-fold cross validation over a range of N = {2,3,4} where the number of differentially expressed probes is different for each value of N {16,10,9}. This yields approximately the same number of possible combinations for each value of N (~120), illustrating how classification accuracy can be determined by the permutation itself, not just the number of combinations available. Results shown include accuracies of fixed values of N as well as the dynamic N algorithm described in the methods section. Statistical differences were calculated using the nonparametric Kruskal-Wallis one-way analysis of variance by ranks, and a p-value < 0.05 was considered significant. If bars share the same letter they are not statistically different. The datasets are derived from [2] and represent a wide range of cancers. Significance plots for all nine cancer datasets are in Additional file 1: Figure S4.
The five MAQC-II datasets, representing endpoints A through I that are available from the Gene Expression Omnibus
| A | Lung tumorigen | Affymetrix Mouse 430 2.0 | |
| B | Non-genotoxic liver carcinogens | Amersham Uniset Rat 1 Bioarray | |
| C | Liver toxicants | Affymetrix Rat 230 2.0 | |
| D | Pre-operative treatment response | Affymetrix Human U133A | |
| | E | Estrogen receptor status | |
| F | Overall survival milestone outcome | Affymetrix Human U133 Plus 2.0 | |
| | G | Event-free survival milestone outcome | |
| | H | Gender of patient (positive control) | |
| I | Random class labels (negative control) |
The participants that submitted models for every endpoint (original and swap) in the MAQC-II study, and the classification methods used
| Chinese Academy of Sciences | Naïve Bayes, Support Vector Machine | |
| CapitalBio Corporation, China | k-Nearest Neighbor, Support Vector Machine | |
| Weill Medical College of Cornell University | Support Vector Machine | |
| Fondazione Bruno Kessler, Italy | Discriminant Analysis, Support Vector Machine | |
| GeneGo, Inc. | Discriminant Analysis, Random Forest | |
| Golden Helix, Inc. | Classification Tree | |
| GlaxoSmithKline | Naïve Bayes | |
| National Center for Toxicological Research, FDA | k-Nearest Neighbor, Naïve Bayes, Support Vector
Machine | |
| Northwestern University | k-Nearest Neighbor, Classification Tree, Support Vector
Machine | |
| Systems Analytics, Inc. | Discriminant Analysis, k-Nearest Neighbor, Machine Learning,
Support Vector Machine, Logistic Regression | |
| SAS Institute, Inc. | Classification Tree, Discriminant Analysis, Logistic
Regression, Partial Least Squares, Support Vector
Machine | |
| Tsinghua University, China | Classification Tree, k-Nearest Neighbor, Recursive Feature
Elimination, Support Vector Machine | |
| University of Illinois, Urbana-Champaign | Classification Tree, k-Nearest Neighbor, Naïve Bayes,
Support Vector Machine | |
| University of Southern Mississippi | Artificial Neural Network, Naïve Bayes, Sequential
Minimal Optimization, Support Vector Machine | |
| Zejiang University, China | k-Nearest Neighbor, Nearest Centroid |
Figure 5Results of TSN classification on MAQC-II datasets. MCC of MAQC-II endpoints A through I, based on models learned on the training set and then applied to the validation set. MCC values range from +1 (perfect prediction) to −1 (perfect inverse prediction), with 0 indicating random prediction. Boxplots show the MCC distribution of the models from the 15 groups, including TSN, that predicted all original and swap endpoints from the MAQC-II. The original and swap MCC values are averaged for each group. In addition to endpoints A through I, a boxplot showing the mean MCC over endpoints A through H is shown (ALL). We exclude endpoint I from this final boxplot because it is a negative control. The bottom and top of each box indicate the lower and upper quartiles of the data, respectively. The middle line represents the median. The whiskers indicate the extreme values. The asterisk represents the performance of TSN on that dataset. All raw data is included in Additional file 3.
Figure 6ΔMCC Results from MAQC-II data. Boxplots showing the distribution of ΔMCC values on the original data for each group, where ΔMCC = Cross Validation MCC – Validation Set MCC. This illustrates the amount of overfitting present during cross validation. The absolute value of each ΔMCC value was used in the calculations. The cross validation performed for TSN was 5-fold cross validation, repeated 10 times, as recommended by the MAQC-II consortium. Boxplots are sorted by the mean ΔMCC for each group (asterisk). All raw data is included in Additional file 3.