| Literature DB >> 27406289 |
R Iniesta1, D Stahl2, P McGuffin1.
Abstract
Psychiatric research has entered the age of 'Big Data'. Datasets now routinely involve thousands of heterogeneous variables, including clinical, neuroimaging, genomic, proteomic, transcriptomic and other 'omic' measures. The analysis of these datasets is challenging, especially when the number of measurements exceeds the number of individuals, and may be further complicated by missing data for some subjects and variables that are highly correlated. Statistical learning-based models are a natural extension of classical statistical approaches but provide more effective methods to analyse very large datasets. In addition, the predictive capability of such models promises to be useful in developing decision support systems. That is, methods that can be introduced to clinical settings and guide, for example, diagnosis classification or personalized treatment. In this review, we aim to outline the potential benefits of statistical learning methods in clinical research. We first introduce the concept of Big Data in different environments. We then describe how modern statistical learning models can be used in practice on Big Datasets to extract relevant information. Finally, we discuss the strengths of using statistical learning in psychiatric studies, from both research and practical clinical points of view.Entities:
Keywords: Machine learning; outcome prediction; personalized medicine; predictive modelling; statistical learning
Mesh:
Year: 2016 PMID: 27406289 PMCID: PMC4988262 DOI: 10.1017/S0033291716001367
Source DB: PubMed Journal: Psychol Med ISSN: 0033-2917 Impact factor: 7.723
Fig. 1.Main steps of the learning process.
Main properties of a set of selected statistical learning algorithms
| Machine learning algorithm | Details |
|---|---|
| General linear regression models (GLM) |
A very simple approach based on specifying a linear combination of predictors to predict a dependent variable (Hastie Coefficients of the model are a measure of the strength of effect of each predictor on the outcome Include linear and logistic regression models (Hosmer Can present overfitting and multicollinearity in high-dimensional problems |
| Elastic net models |
Extension of general linear regression models (Zou, 2005) Explore a large number of predictors to keep the best set of variables in predicting the outcome. This is an internal feature selection method that avoids too complex models and thus prevents of overfitting Strongly correlated predictors are selected (or discarded) together (what is known as a ‘grouping effect’). This is especially interesting in an exploratory research where the full list of predictors to explore can result equally relevant and meaningful Coefficients can be interpreted as in a general linear model Lasso and ridge regression are particular cases (Tibshirani, 1994) |
| Naive Bayes |
Family of simple classifiers based on applying the Bayes’ Theorem (Russell & Norvig, 2010) (see Assumes ( Gives the probability of taking a specific outcome value for unseen cases |
| Classification and Regression Trees (CART) |
A tree is a flowchart like structure (Breiman, Allows modeling complex nonlinear relationships Relatively fast to construct and produce interpretable models Performs internal feature selection as an integral part of the procedure |
| Random forest |
Offers a rule to combine individual decision trees (Breiman, Multiple tree models are built using different randomly selected subsamples of the full dataset and different initial variables. Then they are aggregated and the most popular outcome value is voted Good to control overfitting and improve stability and inaccuracy |
| Support vector machines (SVM) |
Classifier method that constructs hyperplanes in a multidimensional space that separates cases of different outcome values (Cortes & Vapnik, 1995; Scholkopf A new case is classified depending on his relative position to the decision boundary Allows modeling complex non-linear relationships. A set of transformations called ‘kernels’ is used to map data and make them linearly separable Understanding the contribution of each predictor to outcome prediction is not straightforward and must be explored using specific methods (Altmann |
| Artificial neural networks (ANN) |
A computer system that simulates the essential features of neurons and their interconnections with the aim of processing information the same way as real networks of neurons do (Ripley, 1996) (see A neuron receives inputs from other neurons through dendrites, processes them, and delivers an outcome through axon. Connections between neurons are weighted during training. Input nodes are features, output nodes are outcomes. Between them there are hidden layers that are formed of a set of nodes Allow modelling complex nonlinear relationships Less likely to be used in medical research due to the lack of interpretability of ( |
All methods listed above can be used for classification (categorical outcome) and for regression (quantitative outcomes) problems. All of them can handle multiple continuous and categorical predictors.
Fig. 2.(a) Data simulated from a follow-up study of major depression patients. Age of depression onset (years) and the MADRS score at baseline ranging from 0 to 60 (0–6, normal; 7–19, mild depression; 20–34, moderate depression; >34, severe depression) are the predictor variables. The outcome is remission status at the end of the follow-up (YES or NO). (b) The Naive Bayes classifier is often represented as this type of graph. The direction of the arrows states that each class causes certain features, with a certain probability. (c) A hyper plane (a line, in dimension 2) is built at a maximal distance to every dashed line (called margin). A new case (point) will be classified as remission or non-remission depending on his relative position to the line (aka decision boundary). (d) A simple decision tree suggesting that patients with age of onset lower than 29 are more likely to reach a remission. (e) Each node represents an artificial neuron and each arrow a connection from the output of one neuron to the input of another.
Glossary of statistical/machine learning terms used in this paper
| Term | Definition |
|---|---|
| Feature/attribute/predictor | A numerical (e.g. subset of real values) or categorical (i.e. a finite number of discrete values) value used as input to a learning algorithm |
| Outcome/response/label | A numerical or categorical value to predict from features |
| Labelled data | A set of features and labels for an observation |
| Training set | A collection of data used to train a learning algorithm |
| Test set | A collection of labelled data |
| Supervised learning | Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label) |
| Unsupervised learning | Learning techniques that group observations without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised |
| Model | A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most learning algorithms generate models that can then be used in a decision-making process |
| Accuracy | The rate of correct predictions made by the model over a dataset. Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process. More complex accuracy estimation techniques, such as cross-validation and the bootstrap, are commonly used, especially with datasets containing a small number of observations |
| High-dimensional problem | Problems in which the number of features |
| Overfitting | A modelling error that occurs when the model is too closely fit to a limited set of data points. As data being studied often has some degree of error or random noise, an overfitted model is poor in predicting new cases |
| Multicollinearity | Correlation between features, i.e. the situation where if the value of a feature change, values for the rest of features also change at some degree. When there is multicollinearity between variables in a regression model, its coefficients can become poorly determined and exhibit high variance |
| A method for estimating the accuracy (or error) of a learning algorithm by dividing the data into |
Fig. 3.Example of a 5-fold cross-validation. Data are randomly split in 5-fold of equal size. At every step, one fold is selected as test dataset and the remaining four are used as training data. This procedure is repeated five times, selecting in every step a different fold as test data.