| Literature DB >> 32251439 |
Ashley E Tate1, Ryan C McCabe2, Henrik Larsson1,3, Sebastian Lundström4,5, Paul Lichtenstein1, Ralf Kuja-Halkola1.
Abstract
BACKGROUND: Predicting which children will go on to develop mental health symptoms as adolescents is critical for early intervention and preventing future, severe negative outcomes. Although many aspects of a child's life, personality, and symptoms have been flagged as indicators, there is currently no model created to screen the general population for the risk of developing mental health problems. Additionally, the advent of machine learning techniques represents an exciting way to potentially improve upon the standard prediction modelling technique, logistic regression. Therefore, we aimed to I.) develop a model that can predict mental health problems in mid-adolescence II.) investigate if machine learning techniques (random forest, support vector machines, neural network, and XGBoost) will outperform logistic regression.Entities:
Mesh:
Year: 2020 PMID: 32251439 PMCID: PMC7135284 DOI: 10.1371/journal.pone.0230389
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Descriptive information from the partitioned data.
| N | Birth year | Sex | SDQ Cutoff | |
|---|---|---|---|---|
| Mean (SD) | Male % | Cut off reached % | ||
| Trainset | 4554 | 1996.5 (1.69) | 48.4% | 12.1% |
| Tuneset | 804 | 1996.3 (1.68) | 49.6% | 12.3% |
| Testset | 2280 | 1996.5 (1.68) | 48.1% | 11.5% |
Information on techniques.
| Technique | R Package used | Descrption |
|---|---|---|
| Random Forest | RandomForest [ | Decision trees are a model type that groups data in a tree like structure based on if-then-else decisions. At each decision point (node), data is branched off into smaller subgroups based on one of the predictor variables. Random forest is a method based on aggregating the results of many decision trees and prediction is determined based on the majority decision [ |
| XGBoost | XGBoost [ | XGBoost, or extreme gradient boosting, uses gradient boosting within random forest. Gradient boosting works by assigning scores to each leaf of the tree and builds new trees based on the performance of previously created trees, thus varying weight is assigned to each tree. This is in contrast to standard boosting techniques in random forest that work by assigning equal weights to trees [ |
| Logistic Regression | Base R | Logistic regression represents the standard method in epidemiology for analyzing binary outcomes [ |
| Neural Network | Neuralnet [ | Neural network features numerous interconnected processors, or “neurons”, organized in multiple layers: input, hidden, and output [ |
| Support Vector Machines | e1701[ | Support Vector Machines works by dividing classes, i.e., cases versus non-cases, based on a line called a hyperplane. The hyperplane is created based on the greatest possible distance of the nearest neighboring predictor data points between the classes. Data with higher complexity that cannot be separated in two dimensions can be lifted to a higher dimension through a process called kernelling [ |
*mlr [42] was also used for all techniques
Fig 1Learning curve.
The learning curve specifying the performance of each technique without any data nor hyper-parameter modification (y axis) given the total percent of the dataset (x axis) used to train the models.
Fig 2AUC curves for tune set.
The AUC performance for each technique using the tune set.
Model performance on tune set.
| Learner | AUC | 95% bootstrap interval |
|---|---|---|
| Logistic Regression | 0.750 | 0.693–0.805 |
| XGBoost | 0.723 | 0.662–0.778 |
| Random Forest | 0.754 | 0.698–0.804 |
| Support Vector Machines | 0.754 | 0.701–0.802 |
| Neural Network | 0.715 | 0.658–0.769 |
Model performance on test set.
| Learner | AUC | 95% bootstrap interval |
|---|---|---|
| Logistic Regression | 0.700 | 0.665–0–734 |
| XGBoost | 0.692 | 0.660–0.723 |
| Random Forest | 0.739 | 0.708–0.769 |
| Support Vector Machines | 0.736 | 0.707–0.765 |
| Neural Network | 0.705 | 0.671–0.737 |
Fig 3AUC curves for test set.
The AUC performance for each technique using the test set.
Variable importance in random forest.
| Predictor (Source) | Importance |
|---|---|
| Oppositional Defiant symptoms | 136.97 |
| Impulsivity symptoms | 94.05 |
| Inattention symptoms | 92.66 |
| Executive dysfunction | 87.72 |
| Emotional symptoms | 76.82 |
| Neighborhood deprivation | 64.03 |
| Peer difficulty | 53.22 |
| Parity | 44.17 |
| Gestational age at birth | 43.71 |
| Separation anxiety | 43.13 |
1 Autism—Tics, AD/HD and other Comorbidities inventory [58]
2 the Longitudinal Integration Database for Health Insurance and Labor Market Studies[35]
3 Medical Birth Register [33]