| Literature DB >> 35835556 |
Bálint Király1,2, Balázs Hangya3.
Abstract
Model selection is often implicit: when performing an ANOVA, one assumes that the normal distribution is a good model of the data; fitting a tuning curve implies that an additive and a multiplicative scaler describes the behavior of the neuron; even calculating an average implicitly assumes that the data were sampled from a distribution that has a finite first statistical moment: the mean. Model selection may be explicit, when the aim is to test whether one model provides a better description of the data than a competing one. As a special case, clustering algorithms identify groups with similar properties within the data. They are widely used from spike sorting to cell type identification to gene expression analysis. We discuss model selection and clustering techniques from a statistician's point of view, revealing the assumptions behind, and the logic that governs the various approaches. We also showcase important neuroscience applications and provide suggestions how neuroscientists could put model selection algorithms to best use as well as what mistakes should be avoided.Entities:
Keywords: Bayes; bootstrap; clustering; cross-validation; information criterion; resampling
Mesh:
Year: 2022 PMID: 35835556 PMCID: PMC9282170 DOI: 10.1523/ENEURO.0066-22.2022
Source DB: PubMed Journal: eNeuro ISSN: 2373-2822
List of common “mines” of model selection and clustering discussed in the paper
|
| Issue | Suggestion | Example |
|---|---|---|---|
| Mine #1 | Selecting models without noticing it | Be aware of the assumptions behind analysis methods; treat the choice among different algorithms as a model selection problem |
|
| Mine #2 | Overfitting with overly complex models | Use statistical model selection tools which penalize too many parameters | Polynomial fitting |
| Mine #3 | Selecting from a pool of poorly fitting models might lead to false confidence | Simulate data from each of the tested models multiple times and test whether the real data are sufficient to distinguish across the competing models |
|
| Mine #4 | Different information criteria might favor different models | Consider the strengths and limitations of the different approaches ( | |
| Mine #5 | Model selection might be sensitive to parameters ignored by the tested models | Avoid model classes that are too restrictive to account for data heterogeneity |
|
| Mine #6 | Cross-validation techniques are prone to overfitting | A data splitting approach was proposed by Genkin and Engel in which optimal model complexity is determined by calculating KL divergence |
|
| Mine #7 | Agglomerative hierarchical clustering is sensitive to outliers | Consider divisive methods | |
| Mine #8 | K-means clustering might converge to local minima | Repeat several times from different starting centroid locations | |
| Mine #9 | Number of clusters not known | Use the elbow method, gap statistics, or model selection approaches |
Figure 1.Examples of model selection problems in neuroscience. , Using MAICE to choose between competing models. Left, We used a bell curve to simulate neural responses as a function of stimulus features, generally referred to as a “tuning curve” (gray) and used a multiplicative gain model (y = ax(s) + ε, where y = x(s) is the baseline tuning curve, a is a scalar and εi is a Gaussian noise term) to simulate a tuning curve change (black crosses). Next, an additive (light green) and a multiplicative model (dark green) was fitted on the simulated data. Smaller AIC value indicated that the multiplicative model fitted better (inset), as expected based on the simulation. Right, We performed n = 100 data simulations and calculated the difference between the AIC values of the competing fitted models. The histogram (and the mean values in the inset) demonstrates that in every case the multiplicative model outperformed the additive one. Error bars show standard deviation from the mean; **p < 0.01; two-sided bootstrap test. , A gain model with both additive and multiplicative components (y = ax(s) + b + ε, where y = x(s) is the baseline tuning curve, a and b are scalars and ε is a Gaussian noise term) was used to stimulate a tuning curve change (black crosses) relative to a baseline tuning curve (gray). Next, an additive (light green) and a multiplicative model (dark green) was fitted on the simulated data. While the best fit curves visible deviated from the simulated tuning curve, a smaller AIC value indicated that the additive model fitted somewhat better. , Using BIC to choose the best fitting ARMA model. Left, An ARMA (p = 2, q = 1) process was used to simulate LFP time series data (gray), where p denotes the order of the AR and q the order of the moving average component. Next, we fitted ARMA models on the data with different p and q values in the range 1–4. Blue trace shows the predicted data based on the best fitting p = 2, q = 1 model. Middle, BIC was calculated for each model, and the model with the lowest value (p = 2, q = 1) was chosen. Top, ARMA (p,q) model, where ϕi are AR parameters, θi are moving average parameters, εi are Gaussian noise terms, and μ is constant. Right, AIC was calculated for each model. AIC favored the more complex p = 2, q = 4 model over the expected p = 2, q = 1 model. , Demonstration of the use of information criteria and the parametric bootstrap technique for choosing the number of modes in a distribution. We simulated phase preference data of neuronal firing (n = 250) referenced to an LFP oscillation as the combination of three wrapped normal distributions (top left, D = π/2 refers to the phase difference between the mean of the two closest wrapped normal distributions). Next, mixture models of 1–4 von Mises distributions (circular analog of the normal distribution that closely approximates the wrapped normal distribution) was fitted on the distributions (right) using an expectation maximization algorithm for circular data (Czurkó et al., 2011). Minimal AIC and BIC, as well as maximal bootstrap p value correctly suggest that the model of three modes is the best fitting one. BIC is penalizing the higher mode models more than AIC (bottom left). , Same as panel (left), but with D = 0.85 π/2 phase difference parameter. While minimal AIC and maximal bootstrap p value still suggest that the model of three modes is the best fitting one, BIC favored a simpler model with two modes. , Information criterion difference between the two-mode and three-mode models as a function of the phase difference parameter D (left, n = 2000) and the sample size (n, at D = 0.85 π/2). Color coding of the y-axis reflects the favored model (orange: 2 modes; red: 3 modes). If the modes are well-separated and the sample size is sufficient, both AIC and BIC choose the correct three-mode model, while they can fail for low sample size or less distinguishable modes. At parameters where the information criterion differences are closer to zero, BIC might favor the two-mode model while AIC might correctly identify three modes.
Advantages and limitations of model selection and clustering algorithms
| Method | Advantages | Limitations | Suggestions |
|---|---|---|---|
| Statistical model selection | |||
| Akaike information criterion (AIC) | Strong mathematical basis (KL-divergence) | May lead to false confidence in marginally better models | If critical, perform simulations to ascertain true differences among the tested models |
| Easy to calculate | Difficult to test whether differences are significant | ||
| Suitable for comparing models of different complexity | Not suitable for low sample sizes | Consider AICc if its assumptions are met | |
| Bayesian information criterion (BIC) | Strong mathematical basis (Bayesian statistics) | Asymptotic properties may not hold for complex (multi-parameter) models | BIC is more recommended for simpler models, especially when overfitting is a concern, e.g., deciding the order of an AR process |
| Easy to calculate | Difficult to test whether differences are significant | Consider simulations, as for AIC | |
| Resampling methods | No assumptions on data distributions | CPU-intensive | Parametric bootstrap and cross-validation are often the best choice for testing models with few parameters |
| Provides a | Does not always converge to the true model (statistically inconsistent in the M-closed case) | ||
| Clustering | |||
| Hierarchical clustering, agglomerative | Simple | CPU-intensive for large datasets | With careful consideration of choosing similarity measure, clustering rule and other parameters, the flexibility of hierarchical clustering can be used to its advantage; test the robustness of the results by exploring the parameter space |
| Easy to interpret | Sensitive to outliers and choices of algorithms and parameters | ||
| Hierarchical clustering, divisive | Includes more robust and CPU-efficient options | Sensitive to choices of algorithms and parameters | |
| K-means clustering | CPU-efficient | Requires a priori estimate of number of clusters | Ideal choice if number of expected clusters is known; explore robustness of results by starting the algorithm from different sets of centroids |
| Does not rely on many parameters | May converge to local minima and not find the global optimum | ||
The last column provides suggestions on how to use.
Figure 2.Examples of clustering problems in neuroscience. , Examples of simplified quantitative measures of waveforms often used in spike sorting. , Examples of distance measures used for quantifying similarity between data points and clusters often used in spike sorting. The Mahalanobis distance normalizes the standard deviation across dimensions, while template matching approaches are based on waveform correlations with predefined waveform templates. , Hierarchical clustering of simulated neuronal activity (peri-event time histograms) of n = 50 neurons with 3 (seemingly) very well separated groups (left). First, principal component analysis (PCA) was used to reduce dimensionality of the time series data. Second, agglomerative and divisive hierarchical clustering was performed in the space spanned by the first two principal components (right). Agglomerative clustering separated a single outlier cell (magenta arrow) in an earlier step (second) than the three main clusters, in contrast with divisive clustering. The two methods also differed in the clustering of a cell similar to more than one main groups (black arrow). , Clustering of human cells from multiple cortical areas based on RNA-sequencing data (n = 50,281 genes, publicly available at https://portal.brain-map.org/atlases-and-data/rnaseq/human-multiple-cortical-areas-smart-seq). Top, Trimmed mean expression of n = 20 marker genes. Middle, Agglomerative hierarchical clustering was performed based on the first 20 principal components, revealing the hierarchy of cell types (branches of the dendrogram were identified based on marker gene expression). Bottom, Soft K-means (k = 2) clustering was performed to assign probabilities to all cells of belonging to each of two main cell types, identified as excitatory and inhibitory based on marker gene expression. , Spike sorting of simulated action potentials (n = 221) using K-means clustering. Left, We applied the elbow method on the average intercluster squared Euclidean distance of all points to find the optimal number of clusters (k = 4, black arrow), based on 100 repetitions of K-means clustering for each k in the range of 1–10. Right, K-means clustering (k = 4) was performed with three different initial centroid locations (black crosses, 3 of which were kept at fix positions while one was changed), leading to surprisingly different clusters.