| Literature DB >> 32551378 |
Frederico Cruz-Jesus1, Mauro Castelli1, Tiago Oliveira1, Ricardo Mendes1, Catarina Nunes1, Mafalda Sa-Velho1, Ana Rosa-Louro1.
Abstract
Understanding academic achievement (AA) is one of the most global challenges, as there is evidence that it is deeply intertwined with economic development, employment, and countries' wellbeing. However, the research conducted on this topic grounds in traditional (statistical) methods employed in survey (sample) data. This paper presents a novel approach, using state-of-the-art artificial intelligence (AI) techniques to predict the academic achievement of virtually every public high school student in Portugal, i.e., 110,627 students in the academic year of 2014/2015. Different AI and non-AI methods are developed and compared in terms of performance. Moreover, important insights to policymakers are addressed.Entities:
Keywords: Achievement; Applied computing; Artificial intelligence; Data analysis; Data science; Education; Education reform; Evaluation in education; Information systems; Quantitative research; Teaching research
Year: 2020 PMID: 32551378 PMCID: PMC7287246 DOI: 10.1016/j.heliyon.2020.e04081
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Previous studies addressing academic achievement.
| References | Methods | St | Pa | Sc |
|---|---|---|---|---|
| ( | Regression models | |||
| ( | Regression models | |||
| ( | General linear model | |||
| ( | Linear Programming techniques | |||
| ( | Frequency, Variance, and Structural models | |||
| ( | Regression models | |||
| ( | Hierarchical linear models | |||
| ( | Internet recorded | |||
| ( | Hierarchical linear model | |||
| ( | Item Response Theory; Regressions models | |||
| ( | Regression models | |||
| ( | Interviews | |||
| ( | Hierarchical linear models | |||
| ( | Hierarchical linear models; Regression models | |||
| ( | Hierarchical linear models; ANOVA tests | |||
| ( | Regression models | |||
| ( | Hierarchical linear models; Panel data models | |||
| ( | Tobit regression models; Univariate and Multivariate analyses | |||
| ( | Univariate analyses of variance; Chi-square tests | |||
| ( | Regression models | |||
| ( | Regression models | |||
| (S. | Regression model, Artificial Neural Networks, Radial Basis Function, and Support Vector Machines. | |||
| ( | Multiple group factor analytic models; Full maximum likelihood | |||
| ( | Descriptive statistics T-tests | |||
| ( | Regression models | |||
| ( | Regression discontinuity design; Control for school fixed effects; Regression models | |||
| ( | Probit regression; Regression models | |||
| ( | Hierarchical linear models | |||
| ( | Ordinary least squares | |||
| ( | Random Forests, decision trees, support vector machines and naïve Bayes | |||
| ( | Artificial neural networks |
Independent variables of the considered dataset. Variables represent demographic information, financial information of students’ families, and information about the school and the area in which the school is located.
| Variable | Description |
|---|---|
| x0 | Year of the study cycle |
| x1 | Portuguese citizenship (1 = Yes) |
| x2 | Portuguese naturality (1 = Yes) |
| x3 | Gender (1 = Female) |
| x4 | Student's age (years) |
| x5 | Number of enrolled years in high school |
| x6 | Number of failures in the educational career |
| x7 | Scholarship |
| x8 | Level of financial support received by government |
| x9 | Availability of a Personal Computer (PC) at home (1 = Yes) |
| x10 | Internet access (1 = Yes) |
| x11 | Class size (# students) |
| x12 | School size (# students) |
| x13 | Economic level of residence area |
| x14 | Population density of residence area |
| x15 | Rural residence area (1 = Rural) |
| x16 | Number of unit courses attended in the present academic year |
Figure 1Boxplots of accuracy, recall, and AUROC for training and test instances for the ML techniques considered. Higher values correspond to a better performance of the models.
P-values returned by the Wilcoxon test.
| ANN | DT | ET | RF | SVM | KNN | |
|---|---|---|---|---|---|---|
| ANN | - | <10−8 | 3.49∗10−3 | 3.31∗10−6 | <10−8 | <10−8 |
| DT | - | <10−8 | <10−8 | <10−8 | 2.28∗10−2 | |
| ET | - | <10−8 | <10−8 | <10−8 | ||
| RF | - | <10−8 | <10−8 | |||
| SVM | - | <10−8 | ||||
| KNN | - |
Accuracy, recall, and AUROC values on the test instances for the best model produced by the considered techniques.
| Test Set | ANN | DT | ET | RF | SVM | KNN | LR |
|---|---|---|---|---|---|---|---|
| Accuracy | 76.5% | 79.0% | 77.6% | 79.4% | 51.2% | 79.5% | 81.1% |
| Recall | 73.0% | 63.4% | 70.6% | 69.4% | 86.3% | 63.2% | 48.7% |
| AUROC | 0.75 | 0.73 | 0.75 | 0.76 | 0.65 | 0.55 | 0.55 |
Note that for this problem, accuracy is biased.
Lift and captured response of the models.
| Test Set | ANN | DT | ET | RF | SVM | KNN | LR |
|---|---|---|---|---|---|---|---|
| Cumulative Lift at 5% | 4.26 | 4.59 | 3.54 | 4.65 | 2.45 | 3.42 | 2.59 |
| Cumulative Lift at 15% | 3.13 | 3.10 | 2.82 | 3.28 | 1.41 | 2.96 | 2.66 |
| Cumulative Captured Response 5% | 21% | 23% | 18% | 23% | 12% | 17% | 13% |
| Cumulative Captured Response 15% | 47% | 46% | 42% | 49% | 33% | 44% | 40% |
| Threshold 15% | 0.758 | 0.703 | 0.647 | 0.722 | 0.659 | 0.727 | 0.349 |
Feature importance values for Random Forest, Decision Trees, and Extra Trees.
| Random Forest (RFs) | Decision Trees (DTs) | Extra Trees (ETs) | |||
|---|---|---|---|---|---|
| Feature | Importance | Feature | Importance | Feature | Importance |
| Number of unit courses attended in the present academic year | 0.5300 | Number of unit courses attended in the present academic year | 0.5539 | Number of unit courses attended in the present academic year | 0.6770 |
| Student's age (years) | 0.1429 | School size (# students) | 0.1368 | Gender | 0.1167 |
| School size (# students) | 0.0924 | Economic level of residence area | 0.0770 | Student's age (years) | 0.0484 |
| Gender | 0.0737 | Gender | 0.0524 | Economic level of residence area | 0.0330 |
| Economic level of residence area | 0.0363 | Student's age (years) | 0.0506 | School size (# students) | 0.0277 |
Note: Feature importance is only possible to be computed in tree-based models.